Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

11.3.2 Temporal Differences

To understand how reinforcement learning works, first consider how to average experiences that arrive to an agent sequentially.

Suppose there is a sequence of numerical values, v1,v2,v3,..., and the goal is to predict the next value, given all of the previous values. One way to do this is to have a running approximation of the expected value of the v's. For example, given a sequence of students' grades and the aim of predicting the next grade, a reasonable prediction is to predict the average grade.

Let Ak be an estimate of the expected value based on the first k data points v1,...,vk. A reasonable estimate is the sample average:

Ak=(v1+···+vk)/k

Thus,

kAk = v1+···+vk-1+vk
= (k-1) Ak-1 + vk.

Dividing by k gives

Ak = (1-1/k)Ak-1 + vk/k

Let αk=1/k; then

Ak = (1-αk) Ak-1 + αk vk
= Ak-1 + αk(vk-Ak-1).

The difference, vk-Ak-1, is called the temporal difference error or TD error; it specifies how different the new value, vk, is from the old prediction, Ak-1. The old estimate, Ak-1, is updated by αk times the TD error to get the new estimate, Ak. The qualitative interpretation of the temporal difference formula is that if the new value is higher than the old prediction, increase the predicted value; if the new value is less than the old prediction, decrease the predicted value. The change is proportional to the difference between the new value and the old prediction. Note that this equation is still valid for the first value, k=1.

This analysis assumes that all of the values have an equal weight. However, suppose you are keeping an estimate of the expected value of students' grades. If schools start giving higher grades, the newer values are more useful for the estimate of the current situation than older grades, and so they should be weighted more in predicting new grades.

In reinforcement learning, the latter values of vi (i.e., the more recent values) are more accurate than the earlier values and should be weighted more. One way to weight later examples more is to use Equation (11.3.2), but with α as a constant (0<α ≤ 1) that does not depend on k. Unfortunately, this does not converge to the average value when variability exists in the values in the sequence, but it can track changes when the underlying process generating the values changes.

You could reduce α more slowly and potentially have the benefits of both approaches: weighting recent observations more and still converging to the average. You can guarantee convergence if

k=1αk = ∞ and  ∑k=1αk2 < ∞.

The first condition is to ensure that random fluctuations and initial conditions get averaged out, and the second condition guarantees convergence. Note that guaranteeing convergence to the average is not compatible with being able to adapt to make better predictions when the underlying process generating the values keeps changing.

For the rest of this chapter, α without a subscript is assumed to be a constant. With a subscript it is a function of the number of cases that have been combined for the particular estimate.