13.3 Temporal Differences

To understand how reinforcement learning works, consider how to average values that arrive to an agent sequentially. Section A.1 discusses how to maintain rolling averages, which is the basis of temporal differences.

Suppose there is a sequence of numerical values, v1,v2,v3,, and the aim is to predict the next. A rolling average Ak is maintained, and updated using the temporal difference equation, derived in Section A.1:

Ak =(1αk)Ak1+αkvk
=Ak1+αk(vkAk1) (13.1)

where αk=1k. The difference, vkAk1, is called the temporal difference error or TD error; it specifies how different the new value, vk, is from the old prediction, Ak1. The old estimate, Ak1, is updated by αk times the TD error to get the new estimate, Ak.

A qualitative interpretation of the temporal difference equation is that if the new value is higher than the old prediction, increase the predicted value; if the new value is less than the old prediction, decrease the predicted value. The change is proportional to the difference between the new value and the old prediction. Note that this equation is still valid for the first value, k=1, in which case A1=v1.

In reinforcement learning, the values are often estimates of the effects of actions; more recent values are more accurate than earlier values because the agent is learning, and so they should be weighted more. One way to weight later examples more is to use Equation 13.1, but with α as a constant (0<α1) that does not depend on k. This does not converge to the average value when there is variability in the values of the sequence, but it can track changes when the underlying process generating the values changes. See Section A.1.

One way to give more weight to more recent experiences, but also converge to the average, is to set αk=(r+1)/(r+k) for some r>0. For the first experience α1=1, so it ignores the prior A0. If r=9, after 11 experiences, α11=0.5 so it weights that experience as equal to all of its prior experiences. The parameter r should be set to be appropriate for the domain.

Guaranteeing convergence to the average is not compatible with being able to adapt to make better predictions when the underlying process generating the values changes, for non-stationary dynamics or rewards.