### 11.3.2 Temporal Differences

To understand how reinforcement learning works, first consider how to average experiences that arrive to an agent sequentially.

Suppose there is a sequence of numerical values, *v _{1},v_{2},v_{3},...*,
and the goal is to predict the next value, given all of the previous
values. One way to do this is to have a running approximation of the expected value
of the

*v*'s. For example, given a sequence of students' grades and the aim of predicting the next grade, a reasonable prediction is to predict the average grade.

Let *A _{k}* be an
estimate of the expected value based on the first

*k*data points

*v*. A reasonable estimate is the sample average:

_{1},...,v_{k}A_{k}=(v_{1}+···+v_{k})/k

Thus,

kA _{k}= v _{1}+···+v_{k-1}+v_{k}= (k-1) A _{k-1}+ v_{k}.

Dividing by *k* gives

A_{k}= (1-1/k)A_{k-1}+ v_{k}/k

Let *α _{k}=1/k*; then

A _{k}= (1-α _{k}) A_{k-1}+ α_{k}v_{k}= A _{k-1}+ α_{k}(v_{k}-A_{k-1}).

The difference, *v _{k}-A_{k-1}*, is called the

**temporal difference error**or

**TD error**; it specifies how different the new value,

*v*, is from the old prediction,

_{k}*A*. The old estimate,

_{k-1}*A*, is updated by

_{k-1}*α*times the TD error to get the new estimate,

_{k}*A*. The qualitative interpretation of the temporal difference formula is that if the new value is higher than the old prediction, increase the predicted value; if the new value is less than the old prediction, decrease the predicted value. The change is proportional to the difference between the new value and the old prediction. Note that this equation is still valid for the first value,

_{k}*k=1*.

This analysis assumes that all of the values have an equal weight. However, suppose you are keeping an estimate of the expected value of students' grades. If schools start giving higher grades, the newer values are more useful for the estimate of the current situation than older grades, and so they should be weighted more in predicting new grades.

In reinforcement learning,
the latter values of
*v _{i}* (i.e., the more recent values) are more accurate than the
earlier values and should be weighted more.
One way
to weight later examples more is to use Equation (11.3.2), but with

*α*as a constant (

*0<α ≤ 1*) that does not depend on

*k*. Unfortunately, this does not converge to the average value when variability exists in the values in the sequence, but it can track changes when the underlying process generating the values changes.

You
could reduce *α* more slowly and potentially have the benefits of both approaches: weighting recent observations more and still converging
to the average.
You can guarantee convergence if

∑_{k=1}^{∞}α_{k}= ∞ and ∑_{k=1}^{∞}α_{k}^{2}< ∞.

The first condition is to ensure that random fluctuations and initial conditions get averaged out, and the second condition guarantees convergence. Note that guaranteeing convergence to the average is not compatible with being able to adapt to make better predictions when the underlying process generating the values keeps changing.

For the rest of
this chapter, *α* without a subscript is assumed to be a
constant. With a subscript it is a function of the number of cases
that have been combined for the particular estimate.