# 12.11 Review

The following are the main points you should have learned from this chapter:

• A Markov decision process is an appropriate formalism for reinforcement learning. A common method is to learn an estimate of the value of doing each action in a state, as represented by the $Q(S,A)$ function.

• In reinforcement learning, an agent should trade off exploiting its knowledge and exploring to improve its knowledge.

• Off-policy learning, such as Q-learning, learns the value of the optimal policy. On-policy learning, such as SARSA, learns the value of the policy the agent is actually carrying out (which includes the exploration).

• Model-based reinforcement learning separates learning the dynamics and reward models from the decision-theoretic planning of what to do given the models.