Intelligence 3E

foundations of computational agents

*Q*-learning is an off-policy learner.
An off-policy learner learns the value of an optimal policy
independently of the agent’s actions, as long as it explores
enough. An off-policy learner can learn the optimal policy even if it
is acting randomly. An
off-policy learner that is exploring does not learn the value of the policy it is
following, because it includes exploration steps.

There may be cases, such as where there are large negative rewards, where ignoring what the agent actually does is dangerous. An alternative is to learn the value of the policy the agent is actually carrying out, which includes exploration steps, so that that policy can be iteratively improved. The learner can thus take into account the costs associated with exploration. An on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps.

SARSA (so called because it uses state–action–reward–state–action experiences to
update the $Q$-values) is an *on-policy* reinforcement learning
algorithm that estimates the value of the policy being
followed. An experience in SARSA is of the form $$, which means that
the agent was in state $s$, did action $a$, received reward $r$, and
ended up in state ${s}^{\prime}$, from which it decided to do action ${a}^{\prime}$. This
provides a new experience to update $Q(s,a)$. The new value that this
experience provides is $r+\gamma Q({s}^{\prime},{a}^{\prime})$.

Figure 13.5 gives the SARSA
algorithm.
The *Q*-values that SARSA computes depend on
the current exploration policy which, for example, may be greedy with random
steps. It can find a different policy than *Q*-learning in
situations where exploring may incur large penalties. For example, when a
robot goes near the top of a flight of stairs, even if this is an optimal policy,
it may be dangerous for exploration steps. SARSA will discover this
and adopt a policy that keeps the robot away from the stairs. It will
find a policy that is optimal, taking into account the exploration
inherent in the policy.

In Example 13.1, the optimal policy is to go up from state ${s}_{0}$ in Figure 13.1. However, if the agent is exploring, this action may be bad because exploring from state ${s}_{2}$ is very dangerous.

If the agent is carrying out the policy that includes exploration, “when in state $s$, 80% of the time select the action $a$ that maximizes $Q[s,a]$, and 20% of the time select an action at random,” going up from ${s}_{0}$ is not optimal. An on-policy learner will try to optimize the policy the agent is following, not the optimal policy that does not include exploration.

The $Q$-values of the optimal policy are less in SARSA than in $Q$-learning. The values for $Q$-learning and for SARSA (the exploration rate in parentheses) for the domain of Example 13.1, for a few state–action pairs, are

Algorithm | $Q[{s}_{0},right]$ | $Q[{s}_{0},up]$ | $Q[{s}_{2},upC]$ | $Q[{s}_{2},up]$ | $Q[{s}_{4},left]$ |
---|---|---|---|---|---|

$Q$-learning | 19.48 | 23.28 | 26.86 | 16.9 | 30.95 |

SARSA (20%) | 9.27 | 7.90 | 14.80 | 4.43 | 18.09 |

SARSA (10%) | 13.04 | 13.95 | 18.90 | 8.93 | 22.47 |

The optimal policy using SARSA with 20% exploration is to go right in state ${s}_{0}$, but with 10% exploration the optimal policy is to go up in state ${s}_{0}$. With 20% exploration, this is the optimal policy because exploration is dangerous. With $10\%$ exploration, going into state ${s}_{2}$ is less dangerous. Thus, if the rate of exploration is reduced, the optimal policy changes. However, with less exploration, it would take longer to find an optimal policy. The value $Q$-learning converges to does not depend on the exploration rate.

SARSA is useful when deploying an agent that
is exploring in the world. If you want to do offline learning, and then use that
policy in an agent that does not explore, *Q*-learning may be more
appropriate.