Artificial Intelligence - foundations of computational agents -- 11.3.7 Assigning Credit and Blame to Paths

Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

11.3.7 Assigning Credit and Blame to Paths

In Q-learning and SARSA, only the previous state-action pair has its value revised when a reward is received. Intuitively, when an agent takes a number of steps that lead to a reward, all of the steps along the way could be held responsible and so receive some of the credit or the blame for a reward. This section gives an algorithm that assigns the credit and blame for all of the steps that lead to a reward.

Example 11.13: Suppose there is an action right that visits the states s₁, s₂, s₃, and s₄ in this order and a reward is only given when the agent enters s₄ from s₃, and any action from s₄ returns to state s₁. There is also an action left that moves to the left except in state s₄. In Q-learning and SARSA, after traversing right through the states s₁, s₂, s₃, and s₄ and receiving the reward, only the value of Q[s₃,right] is updated. If the same sequence of states is visited again, the value of Q[s₂,right] will be updated when it transitions into s₃. The value of Q[s₁,right] is only updated after the next transition from state s₁ to s₂. In this sense, we say that Q-learning does a one-step backup.

Consider updating the value of Q[s₃,right] based on the reward for entering state s₄. From the perspective of state s₄, the algorithm is doing a one-step backup. From the perspective of state s₃, it is doing a one-step look-ahead. To make the algorithm allow the blame to be associated with more than the previous step, the reward from entering step s₄ could do a two-step backup to update s₂ or, equivalently, a two-step look-ahead from s₂ and update s₂'s value when the reward from entering s₄ is received. We will describe the algorithm in terms of a look-ahead but implement it using a backup.

With a two-step look-ahead, suppose the agent is in state s_t, does action a_t, ends up in state s_t+1, and receives reward r_t+1, then does action a_t+1, resulting in state s_t+2 and receiving reward r_t+2. A two-step look-ahead at time t gives the return R_t⁽²⁾ = r_t+1 + γr_t+2 + γ² V(s_t+2), thus giving the TD error

δ_t = r_t+1 + γr_t+2 + γ² V(s_t+2)-Q[s_t,a_t],

where V(s_t+2) is an estimate of the value of s_t+2. The two-step update is

Q[s_t,a_t] ←Q[s_t,a_t]+αδ_t.

Unfortunately, this is not a good estimate of the optimal Q-value, Q^*, because action a_t+1 may not be an optimal action. For example, if action a_t+1 was the action that takes the agent into a position with a reward of -10, and better actions were available, the agent should not update Q[s₀,a₀]. However, this multiple-step backup provides an improved estimate of the policy that the agent is actually following. If the agent is following policy π, this backup gives an improved estimate of Q^π. Thus multiple-step backup can be used in an on-policy method such as SARSA.

Suppose the agent is in state s_t, and it performs action a_t resulting in reward r_t+1 and state s_t+1. It then does action a_t+1, resulting in reward r_t+2 and state s_t+2, and so forth. An n-step return at time t, where n ≥ 1, written R_r⁽ⁿ⁾, is a data point for the estimated future value of the action at time t, given by looking n steps ahead, is

R_t⁽ⁿ⁾ = r_t+1 + γr_t+2 + γ² r_t+3 + ...+ γ^n-1 r_t+n + γⁿ V(s_t+n).

This could be used to update Q[s_t,a_t] using the TD error R_t⁽ⁿ⁾-Q[s_t,a_t]. However, it is difficult to know which n to use. Instead of selecting one particular n and looking forward n steps, it is possible to have an average of a number of n-step returns. One way to do this is to have a weighted average of all n-step returns, in which the returns in the future are exponentially decayed, with a decay of λ. This is the intuition behind the method called SARSA(λ); when a reward is received, the values of all of the visited states are updated. Those states farther in the past receive less of the credit or blame for the reward.

Let

R_t^λ= (1-λ) ∑_n=1^∞λ^n-1 R_t⁽ⁿ⁾,

where (1-λ) is a normalizing constant to ensure we are getting an average. The following table gives the details of the sum:

look-ahead	Weight	Return
1 step	1-λ	r_t+1 + γV(s_t+1)
2 step	(1-λ) λ	r_t+1 + γr_t+2 + γ² V(s_t+2)
3 step	(1-λ) λ²	r_t+1 + γr_t+2 + γ² r_t+3 + γ³ V(s_t+3)
4 step	(1-λ) λ³	r_t+1 + γr_t+2 + γ² r_t+3 + γ³ r_t+4 + γ⁴ V(s_t+3)
···	···	···
n step	(1-λ) λ^n-1	r_t+1 + γr_t+2 + γ² r_t+3 + ...+ γⁿ V(s_t+n)
···	···	···
total	1

Collecting together the common r_t+i terms gives

R_t^λ = r_t+1+γV(s_t+1)- λγV(s_t+1)

+ λγr_t+2 + λγ² V(s_t+2)-λ²γ² V(s_t+2)

+ λ²γ² r_t+3 + λ²γ³ V(s_t+3)-λ³γ³ V(s_t+3)

+ λ³γ³ r_t+4 + λ³γ⁴ V(s_t+4)-λ⁴γ⁴ V(s_t+4)

+....

This will be used in a version of SARSA in which the future estimate of V(s_t+i) is the value of Q[s_t+i,a_t+i]. The TD error - the return minus the state estimate - is

R_t^λ-Q[s_t,a_t] = r_t+1+γQ[s_t+1,a_t+1]-Q[s_t,a_t]

+ λγ( r_t+2 + γQ[s_t+2,a_t+2]-Q[s_t+1,a_t+1])

+ λ²γ²( r_t+3 + γQ[s_t+3,a_t+3]- Q[s_t+2,a_t+2])

+ λ³γ³( r_t+4 + γQ[s_t+4,a_t+4] -Q[s_t+3,a_t+3])

+....

Instead of waiting until the end, which may never occur, SARSA(λ) updates the value of Q[s_t,a_t] at every time in the future. When the agent receives reward r_t+i, it can use the appropriate sum in the preceding equation to update Q[s_t,a_t]. The preceding description refers to all times; therefore, the update r_t+3 + γQ[s_t+3,a_t+3]- Q[s_t+2,a_t+2] can be used to update all previous states. An agent can do this by keeping an eligibility trace that specifies how much a state-action pair should be updated at each time step. When a state-action pair is first visited, its eligibility is set to 1. At each subsequent time step its eligibility is reduced by a factor of λγ. When the state-action pair is subsequently visited, 1 is added to its eligibility.

The eligibility trace is implemented by an array e[S,A], where S is the set of all states and A is the set of all actions. After every action is carried out, the Q-value for every state-action pair is updated.

controller SARSA(λ,S,A,γ,α)

inputs:

S is a set of states

A is a set of actions

γ the discount

α is the step size

λ is the decay rate

internal state:

real array Q[S,A]

real array e[S,A]

previous state s

previous action a

begin

initialize Q[S,A] arbitrarily

initialize e[s,a]=0 for all s,a

observe current state s

select action a using a policy based on Q

repeat forever:

carry out an action a

observe reward r and state s'

select action a' using a policy based on Q

δ←r+ γQ[s',a'] - Q[s,a]

e[s,a] ←e[s,a]+1

for all s",a":

Q[s",a"] ←Q[s",a"] + αδe[s",a"]

e[s",a"] ←γλe[s",a"]

s ←s'

a ←a'

end-repeat

end

Figure 11.14: SARSA(λ)

The algorithm, known as SARSA(λ), is given in Figure 11.14.

Although this algorithm specifies that Q[s,a] is updated for every state s and action a whenever a new reward is received, it may be much more efficient and only slightly less accurate to only update those values with an eligibility over some threshold.

R_t^λ	=	r_t+1+γV(s_t+1)- λγV(s_t+1)
		+ λγr_t+2 + λγ² V(s_t+2)-λ²γ² V(s_t+2)
		+ λ²γ² r_t+3 + λ²γ³ V(s_t+3)-λ³γ³ V(s_t+3)
		+ λ³γ³ r_t+4 + λ³γ⁴ V(s_t+4)-λ⁴γ⁴ V(s_t+4)
		+....

R_t^λ-Q[s_t,a_t]	=	r_t+1+γQ[s_t+1,a_t+1]-Q[s_t,a_t]
		+ λγ( r_t+2 + γQ[s_t+2,a_t+2]-Q[s_t+1,a_t+1])
		+ λ²γ²( r_t+3 + γQ[s_t+3,a_t+3]- Q[s_t+2,a_t+2])
		+ λ³γ³( r_t+4 + γQ[s_t+4,a_t+4] -Q[s_t+3,a_t+3])
		+....