13.11 Review

The following are the main points you should have learned from this chapter:

  • A Markov decision process is an appropriate formalism for reinforcement learning. A common method is to learn an estimate of the value of doing each action in a state, as represented by the Q(S,A) function.

  • In reinforcement learning, an agent should trade off exploiting its knowledge and exploring to improve its knowledge.

  • Off-policy learning, such as Q-learning, learns the value of the optimal policy. On-policy learning, such as SARSA, learns the value of the policy the agent is actually carrying out (which includes the exploration).

  • Model-based reinforcement learning separates learning the dynamics and reward models from the decision-theoretic planning of what to do given the models.

  • For large state or action spaces, reinforcement learning algorithms can be designed to use generalizing learners such as neural networks) to represent the value function, the Q-function and/or the policy.