Intelligence 2E

foundations of computational agents

Dimensions: flat, states, infinite horizon, fully observable, stochastic, utility, learning, single agent, online, bounded rationality

*Q*-learning is an off-policy learner.
An off-policy learner learns the value of an optimal policy
independently of the agent’s actions, as long as it explores
enough. An off-policy learner can learn the optimal policy even if it
is acting randomly. A learning agent should, however, try to exploit what
it has learned by choosing the best action, but it cannot just exploit
because then it will not explore enough to find the best action. An
off-policy learner does not learn the value of the policy it is
following, because the policy it is following includes exploration steps.

There may be cases where ignoring what the agent actually does is dangerous: where there are large negative rewards. An alternative is to learn the value of the policy the agent is actually carrying out, which includes exploration steps, so that it can be iteratively improved. As a result, the learner can take into account the costs associated with exploration. An on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps.

SARSA (so called because it uses state–action–reward–state–action experiences to
update the $Q$-values) is an *on-policy* reinforcement learning
algorithm that estimates the value of the policy being
followed. An experience in SARSA is of the form $$, which means that
the agent was in state $s$, did action $a$, received reward $r$, and
ended up in state ${s}^{\prime}$, from which it decided to do action ${a}^{\prime}$. This
provides a new experience to update $Q(s,a)$. The new value that this
experience provides is $r+\gamma Q({s}^{\prime},{a}^{\prime})$.

Figure 12.5 gives the SARSA algorithm.

The *Q*-values that SARSA computes depend on
the current exploration policy which, for example, may be greedy with random
steps. It can find a different policy than *Q*-learning in
situations when exploring may incur large penalties. For example, when a
robot goes near the top of stairs, even if this is an optimal policy,
it may be dangerous for exploration steps. SARSA will discover this
and adopt a policy that keeps the robot away from the stairs. It will
find a policy that is optimal, taking into account the exploration
inherent in the policy.

In Example 12.1, the optimal policy is to go up from state ${{s}}_{{\mathrm{0}}}$ in Figure 12.1. However, if the agent is exploring, this action may be bad because exploring from state ${{s}}_{{\mathrm{2}}}$ is very dangerous.

If the agent is carrying out the policy that includes exploration, “when in state ${s}$, 80% of the time select the action ${a}$ that maximizes ${Q}{\mathit{}}{\mathrm{[}}{s}{\mathrm{,}}{a}{\mathrm{]}}$, and 20% of the time select an action at random,” going up from ${{s}}_{{\mathrm{0}}}$ is not optimal. An on-policy learner will try to optimize the policy the agent is following, not the optimal policy that does not include exploration.

The ${Q}$-values of the optimal policy are less in SARSA than in ${Q}$-learning. The values for ${Q}$-learning and for SARSA (the exploration rate in parentheses) for the domain of Example 12.1, for a few state–action pairs are:

Algorithm | ${Q}{}{[}{{s}}_{{0}}{,}{r}{}{i}{}{g}{}{h}{}{t}{]}$ | ${Q}{}{[}{{s}}_{{0}}{,}{u}{}{p}{]}$ | ${Q}{}{[}{{s}}_{{2}}{,}{u}{}{p}{}{C}{]}$ | ${Q}{}{[}{{s}}_{{2}}{,}{u}{}{p}{]}$ | ${Q}{}{[}{{s}}_{{4}}{,}{l}{}{e}{}{f}{}{t}{]}$ |
---|---|---|---|---|---|

${Q}$-learning | 19.48 | 23.28 | 26.86 | 16.9 | 30.95 |

SARSA (20%) | 9.27 | 7.9 | 14.8 | 4.43 | 18.09 |

SARSA (10%) | 13.04 | 13.95 | 18.9 | 8.93 | 22.47 |

The optimal policy using SARSA with 20% exploration is to go right in state ${{s}}_{{\mathrm{0}}}$, but with 10% exploration the optimal policy is to go up in state ${{s}}_{{\mathrm{0}}}$. With 20% exploration, this is the optimal policy because exploration is dangerous. With ${\mathrm{10}}{\mathrm{\%}}$ exploration, going into state ${{s}}_{{\mathrm{2}}}$ is less dangerous. Thus, if the rate of exploration is reduced, the optimal policy changes. However, with less exploration, it would take longer to find an optimal policy. The value ${Q}$-learning converges to does not depend on the exploration rate.

SARSA is useful when deploying an agent that
is exploring in the world. If you want to do offline learning, and then use that
policy in an agent that does not explore, *Q*-learning may be more
appropriate.