# A.1 Rolling Average

It is common to get information sequentially, and require a rolling average, the mean of the values already seen, or the mean of the most recent values.

Suppose there is a sequence of numerical values, $v_{1},v_{2},v_{3},\dots$, and the goal is to predict the mean after the first $k$ values for each $k$. The rolling average, $A_{k}$, is the mean of the first $k$ data points $v_{1},\dots,v_{k}$, namely

 $A_{k}=\frac{v_{1}+\cdots+v_{k}}{k}~{}.$

Thus

 $\displaystyle k*A_{k}$ $\displaystyle=v_{1}+\cdots+v_{k-1}+v_{k}$ $\displaystyle=(k-1)A_{k-1}+v_{k}.$

Dividing by $k$ gives

 $A_{k}=\left(1-\frac{1}{k}\right)*A_{k-1}+\frac{v_{k}}{k}~{}.$

Let $\alpha_{k}=\frac{1}{k}$, then

 $\displaystyle A_{k}=$ $\displaystyle(1-\alpha_{k})*A_{k-1}+\alpha_{k}*v_{k}$ $\displaystyle=$ $\displaystyle A_{k-1}+\alpha_{k}*(v_{k}-A_{k-1}).$ (A.1)

Predicting the mean makes sense if all of the values have an equal weight. However, suppose you are keeping an estimate of the expected price of some item in the grocery store. Prices go up and down in the short term, but tend to increase slowly; the newer prices are more useful for the estimate of the current price than older prices, and so they should be weighted more in predicting new prices.

Suppose, instead, you want to maintain the average of the previous $n$ values. The first $n$ values need to be treated specially. Each example (after the $n$th) contributes $1/n$ to the average. When a new value arrives, the oldest is dropped. If $A_{k-1}$ is the average of the previous $n$ values, the next average is

 $A_{k}=A_{k-1}+\frac{v_{k}-v_{k-n}}{n}.$

To implement this, the $n$ most recent values need to be stored, and the average is sensitive to what happened $n$ steps ago. One way to simplify this is to use the rolling average, $A_{k-1}$, instead of $v_{k-n}$. This results in Equation A.1, but with $\alpha_{k}$ a constant, namely $1/n$.

Having $\alpha_{k}=\frac{1}{k}$ averages out noise, but treats all data equally. Having $\alpha_{k}$ a constant means more recent data is used, but any noise in the data become noise in the rolling average. Using a constant, $\alpha$, gives an exponentially-decaying rolling average as item $v_{k-n}$, which is $n$ steps before the current value, has weight $(1-\alpha)^{n}$ in the average.

You could reduce $\alpha$ more slowly and potentially have the benefits of both approaches: weighting recent observations more and still converging to the mean. You can guarantee convergence if

 $\sum_{k=1}^{\infty}\alpha_{k}=\infty\mbox{ ~{}~{} and ~{}~{} }\sum_{k=1}^{% \infty}\alpha_{k}^{2}<\infty.$

The first condition is to ensure that random fluctuations and initial conditions get averaged out, and the second condition guarantees convergence.

The rolling average is used for the simple controller of Example 2.3, some of the optimizers for neural networks in Section 8.2, and for reinforcement learning in Section 13.3.