# 12.7 On-Policy Learning

Dimensions: flat, states, infinite horizon, fully observable, stochastic, utility, learning, single agent, online, bounded rationality

Q-learning is an off-policy learner. An off-policy learner learns the value of an optimal policy independently of the agent’s actions, as long as it explores enough. An off-policy learner can learn the optimal policy even if it is acting randomly. A learning agent should, however, try to exploit what it has learned by choosing the best action, but it cannot just exploit because then it will not explore enough to find the best action. An off-policy learner does not learn the value of the policy it is following, because the policy it is following includes exploration steps.

There may be cases where ignoring what the agent actually does is dangerous: where there are large negative rewards. An alternative is to learn the value of the policy the agent is actually carrying out, which includes exploration steps, so that it can be iteratively improved. As a result, the learner can take into account the costs associated with exploration. An on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps.

SARSA (so called because it uses state–action–reward–state–action experiences to update the $Q$-values) is an on-policy reinforcement learning algorithm that estimates the value of the policy being followed. An experience in SARSA is of the form $\left$, which means that the agent was in state $s$, did action $a$, received reward $r$, and ended up in state $s^{\prime}$, from which it decided to do action $a^{\prime}$. This provides a new experience to update $Q(s,a)$. The new value that this experience provides is $r+\gamma Q(s^{\prime},a^{\prime})$.

Figure 12.5 gives the SARSA algorithm.

The Q-values that SARSA computes depend on the current exploration policy which, for example, may be greedy with random steps. It can find a different policy than Q-learning in situations when exploring may incur large penalties. For example, when a robot goes near the top of stairs, even if this is an optimal policy, it may be dangerous for exploration steps. SARSA will discover this and adopt a policy that keeps the robot away from the stairs. It will find a policy that is optimal, taking into account the exploration inherent in the policy.

###### Example 12.5.

In Example 12.1, the optimal policy is to go up from state $s_{0}$ in Figure 12.1. However, if the agent is exploring, this action may be bad because exploring from state $s_{2}$ is very dangerous.

If the agent is carrying out the policy that includes exploration, “when in state $s$, 80% of the time select the action $a$ that maximizes $Q[s,a]$, and 20% of the time select an action at random,” going up from $s_{0}$ is not optimal. An on-policy learner will try to optimize the policy the agent is following, not the optimal policy that does not include exploration.

The $Q$-values of the optimal policy are less in SARSA than in $Q$-learning. The values for $Q$-learning and for SARSA (the exploration rate in parentheses) for the domain of Example 12.1, for a few state–action pairs are:

Algorithm $Q[s_{0},right]$ $Q[s_{0},up]$ $Q[s_{2},upC]$ $Q[s_{2},up]$ $Q[s_{4},left]$
$Q$-learning 19.48 23.28 26.86 16.9 30.95
SARSA (20%) 9.27 7.9 14.8 4.43 18.09
SARSA (10%) 13.04 13.95 18.9 8.93 22.47

The optimal policy using SARSA with 20% exploration is to go right in state $s_{0}$, but with 10% exploration the optimal policy is to go up in state $s_{0}$. With 20% exploration, this is the optimal policy because exploration is dangerous. With $10\%$ exploration, going into state $s_{2}$ is less dangerous. Thus, if the rate of exploration is reduced, the optimal policy changes. However, with less exploration, it would take longer to find an optimal policy. The value $Q$-learning converges to does not depend on the exploration rate.

SARSA is useful when deploying an agent that is exploring in the world. If you want to do offline learning, and then use that policy in an agent that does not explore, Q-learning may be more appropriate.