13.7 On-Policy Learning

Q-learning is an off-policy learner. An off-policy learner learns the value of an optimal policy independently of the agent’s actions, as long as it explores enough. An off-policy learner can learn the optimal policy even if it is acting randomly. An off-policy learner that is exploring does not learn the value of the policy it is following, because it includes exploration steps.

There may be cases, such as where there are large negative rewards, where ignoring what the agent actually does is dangerous. An alternative is to learn the value of the policy the agent is actually carrying out, which includes exploration steps, so that that policy can be iteratively improved. The learner can thus take into account the costs associated with exploration. An on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps.

SARSA (so called because it uses state–action–reward–state–action experiences to update the Q-values) is an on-policy reinforcement learning algorithm that estimates the value of the policy being followed. An experience in SARSA is of the form s,a,r,s,a, which means that the agent was in state s, did action a, received reward r, and ended up in state s, from which it decided to do action a. This provides a new experience to update Q(s,a). The new value that this experience provides is r+γQ(s,a).

1: controller SARSA(S,A,γ,α)
2:      Inputs
3:          S is a set of states
4:          A is a set of actions
5:          γ the discount
6:          α is the step size      
7:      Local
8:          real array Q[S,A]
9:          state s, s
10:          action a, a      
11:      initialize Q[S,A] arbitrarily
12:      observe current state s
13:      select an action a using a policy based on Q[s,a]
14:      repeat
15:          do(a)
16:          observe reward r and state s
17:          select an action a using a policy based on Q[s,a]
18:          Q[s,a]:=Q[s,a]+α(r+γQ[s,a]Q[s,a])
19:          s:=s
20:          a:=a
21:      until termination
Figure 13.5: SARSA: on-policy reinforcement learning

Figure 13.5 gives the SARSA algorithm. The Q-values that SARSA computes depend on the current exploration policy which, for example, may be greedy with random steps. It can find a different policy than Q-learning in situations where exploring may incur large penalties. For example, when a robot goes near the top of a flight of stairs, even if this is an optimal policy, it may be dangerous for exploration steps. SARSA will discover this and adopt a policy that keeps the robot away from the stairs. It will find a policy that is optimal, taking into account the exploration inherent in the policy.

Example 13.5.

In Example 13.1, the optimal policy is to go up from state s0 in Figure 13.1. However, if the agent is exploring, this action may be bad because exploring from state s2 is very dangerous.

If the agent is carrying out the policy that includes exploration, “when in state s, 80% of the time select the action a that maximizes Q[s,a], and 20% of the time select an action at random,” going up from s0 is not optimal. An on-policy learner will try to optimize the policy the agent is following, not the optimal policy that does not include exploration.

The Q-values of the optimal policy are less in SARSA than in Q-learning. The values for Q-learning and for SARSA (the exploration rate in parentheses) for the domain of Example 13.1, for a few state–action pairs, are

Algorithm Q[s0,right] Q[s0,up] Q[s2,upC] Q[s2,up] Q[s4,left]
Q-learning 19.48 23.28 26.86 16.9 30.95
SARSA (20%) 9.27 7.90 14.80 4.43 18.09
SARSA (10%) 13.04 13.95 18.90 8.93 22.47

The optimal policy using SARSA with 20% exploration is to go right in state s0, but with 10% exploration the optimal policy is to go up in state s0. With 20% exploration, this is the optimal policy because exploration is dangerous. With 10% exploration, going into state s2 is less dangerous. Thus, if the rate of exploration is reduced, the optimal policy changes. However, with less exploration, it would take longer to find an optimal policy. The value Q-learning converges to does not depend on the exploration rate.

SARSA is useful when deploying an agent that is exploring in the world. If you want to do offline learning, and then use that policy in an agent that does not explore, Q-learning may be more appropriate.