# 8.2 Improved Optimization

Stochastic gradient descent is the workhorse for parameter learning in neural networks. However, setting the step size is challenging. The structure of the multidimensional search space – the error as a function of the parameter settings – is complex. For a model with, say, a million parameters, the search space has a million dimensions, and the local structure can vary across the dimensions.

One particular challenge is a canyon, where there is a U-shaped error function in one dimension and another dimension has a downhill slope. Another is a saddle point, where a minimum in one dimension meets a maximum of another. Close to a saddle point, moving along the dimension that is at a maximum, there is a canyon, so a way to handle canyons should also help with saddle points.

###### Example 8.4.

Figure 8.6 depicts a canyon with two parameters. The error is plotted as a function of parameters $x$ and $y$. In the $x$-direction the error has a steep valley, but in the $y$-direction there is a gentle slope. A large step size would keep jumping from side-to-side of the valley, perhaps diverging. A small step size is needed for the $x$-value to decrease. However, a small step size in the $y$-direction means very slow convergence. Ideally, you would like small steps in the $x$-direction and large steps in the $y$-direction.

There are two main approaches to adapting step size to handle canyons and related problems. Both have a separate step size for each parameter. They can be combined. The intuition behind each is:

• If the sign of the gradient doesn’t change, the step size can be larger; if the sign keeps changing, the step size should be smaller.

• Each weight update should follow the direction of the gradient, and the magnitude should depend on whether the gradient is more or less than its historic value.

## 8.2.1 Momentum

The momentum for each parameter acts like a velocity of the update for each parameter, with the standard stochastic gradient descent update acting like an acceleration. The momentum acts as the step size for the parameter. It is increased if the acceleration is the same sign as the momentum and is decreased if the acceleration is the opposite sign.

To implement momentum, a velocity $v[i,j]$ is stored for each weight $w[i,j]$. Hyperparameter $\alpha$, with $0\leq\alpha<1$, specifies how much of the momentum should be used for each batch. The $update$ method for $Dense$ in Figure 8.3 becomes

1: method update()$\triangleright$ update all weights
2:      for each $i,j$ do
3:          $v[i,j]\;{:}{=}\;\mbox{}\alpha*v[i,j]-\eta/batch\_size*d[i,j]$
4:          $w[i,j]\;{:}{=}\;\mbox{}w[i,j]+v[i,j]$
5:          $d[i,j]\;{:}{=}\;\mbox{}0$.

The change in weight is not necessarily in the direction of the gradient of the error, because the momentum might be too great.

The value of $\alpha$ affects how much the step size can increase. Common values for $\alpha$ are 0.5, 0.9, and 0.99. If the gradients have the same value, the step size will approach $\sum_{i=0}^{\infty}\alpha^{i}=1/(1-\alpha)$ times the step size without momentum. If $\alpha$ is 0.5, the step size could double. If $\alpha$ is 0.9, the step size could increase up to 10 times, and $\alpha=0.99$ allows the step size to increase by up to 100 times.

###### Example 8.5.

Consider the error surface of Figure 8.6, described in Example 8.4. In the $y$-direction, the gradients are consistent, and so momentum will increase the step size, up to $1/(1-\alpha)$ times. In the $x$-direction, as the valley is crossed, the sign of the gradient changes, and the momentum, and so the steps, get smaller. This means that it eventually makes large steps in the $y$-direction, but small steps in the $x$-direction, enabling it to move down the canyon.

As well as handling canyons, momentum can average out noise due to the batches not including all the examples.

## 8.2.2 RMS-Prop

RMS-Prop (root mean squared propagation) is the default optimizer for the Keras deep learning library. The idea is that the magnitude of the change in each weight depends on how (the square of) the gradient for that weight compares to its historic value, rather than depending on the absolute value of the gradient. For each weight, a rolling average of the square of the gradient is maintained. This is used to determine how much the weight should change. A correction to avoid numeric instability of dividing by approximately zero is also used.

The hyperparameters for RMS-Prop are the learning rate $\eta$ (with a default of $0.001$ in Keras), $\rho$ (with a default of $0.9$ in Keras) that controls the time horizon of the rolling average, and $\epsilon$ (defaults to $10^{-7}$ in Keras) to ensure numerical stability. The algorithm below maintains a rolling average of the square of the gradient for $w[i,j]$ in $r[i,j]$. The $update$ method for $Dense$ in Figure 8.3 becomes

1: method update()$\triangleright$ update weights
2:      for each $i,j$ do
3:          $g\;{:}{=}\;\mbox{}d[i,j]/batch\_size$
4:          $r[i,j]\;{:}{=}\;\mbox{}\rho*r[i,j]+(1-\rho)*g^{2}$
5:          $\displaystyle w[i,j]\;{:}{=}\;\mbox{}w[i,j]-\frac{\eta*g}{\sqrt{r[i,j]+% \epsilon}}$
6:          $d[i,j]\;{:}{=}\;\mbox{}0$ .

To understand the algorithm, assume the value of $r[i,j]$ is initially much bigger than $\epsilon$, so that $r[i,j]+\epsilon\approx r[i,j]$.

• When $r[i,j]\approx g^{2}$, the ratio ${g}/{\sqrt{r[i,j]+\epsilon}}$ is approximately 1 or $-1$, depending on the sign of $g$, so the magnitude of the change in weight is approximately $\eta$.

• When $g^{2}$ is bigger than $r[i,j]$, the error has a larger magnitude than its historical value, $r[i,j]$ is increased, and the step size increases.

• When $g^{2}$ is smaller than $r[i,j]$, the value $r[i,j]$ is decreased, and the step size decreases. When a local minimum is approached, the values of $g^{2}$ and $r[i,j]$ become smaller than the magnitude of $\epsilon$, so the updates are dominated by $\epsilon$, and the step is small.

RMS-Prop only affects the magnitude of the step; the direction of change of $w[i,j]$ is always opposite to $d[i,j]$.

###### Example 8.6.

Consider the error surface of Figure 8.6, described in Example 8.4. In the $y$-direction, the gradients are consistent, and so $g$ will be approximately the same as the square root of $r[i,j]$, the average squared value of $g$, and so, assuming $r[i,j]$ is much greater than $\epsilon$, the change in $w[i,j]$ will have magnitude approximately $\eta$.

In the $x$-direction, it might start with a large gradient, but when it encounters a flatter region, $g^{2}$ becomes less than the rolling average $r[i,j]$, so the step size is reduced. As $g$ for that parameter becomes very small, the steps get very small because they are less than their historical value, eventually becoming dominated by $\epsilon$.

Adam, for “adaptive moments”, is an optimizer that uses both momentum and the square of the gradient, similar to RMS-Prop. It also uses corrections for the parameters to account for the fact that they are initialized at 0, which is not a good estimate to average with. Other mixes of RMS-Prop and momentum are also common.

It takes as hyperparameters: learning-rate ($\eta$), with a default of 0.001; $\beta_{1}$, with a default of 0.9; $\beta_{2}$, with a default of 0.999; and $\epsilon$, with a default of $10^{-7}$. (Names and defaults are consistent with Keras.)

Adam takes the gradient, $d[i,j]/bath\_size$, for the weight $w[i,j]$ and maintains a rolling average of the gradient in $s[i,j]$, a rolling average of the square of the gradient in $r[i,j]$, and the step number, $t$. These are all initialized to zero. To implement Adam, the $update$ method for $Dense$ in Figure 8.3 becomes:

1: method update()$\triangleright$ update weights
2:      $t\;{:}{=}\;\mbox{}t+1$
3:      for each $i,j$ do
4:          $g\;{:}{=}\;\mbox{}d[i,j]/batch\_size$
5:          $s[i,j]\;{:}{=}\;\mbox{}\beta_{1}*s[i,j]+(1-\beta_{1})*g$
6:          $r[i,j]\;{:}{=}\;\mbox{}\beta_{2}*r[i,j]+(1-\beta_{2})*g^{2}$
7:          $\displaystyle w[i,j]\;{:}{=}\;\mbox{}w[i,j]-\frac{\eta*s[i,j]/(1-\beta_{1}^{t}% )}{\sqrt{r[i,j]/(1-\beta_{2}^{t})}+\epsilon}$
8:          $d[i,j]\;{:}{=}\;\mbox{}0$.

The weight update step (line 7) is like RMS-Prop but uses $s[i,j]$ instead of the gradient in the numerator, and corrects it by dividing by $1-\beta_{1}^{t}$. Note that $\beta_{1}$ and $\beta_{2}$ are the names of the parameters, but the superscript is the power. In the first update, when $t=1$, $s[i,j]/(1-\beta_{1}^{t})$ is equal to $g$; it corrects subsequent updates similarly. It also corrects $r[i,j]$ by dividing by $1-\beta_{2}^{t}$. Note that $\epsilon$ is inside the square root in RMS-Prop, but is outside in Adam.

## 8.2.4 Initialization

For convergence, it helps to normalize the input parameters, and sensibly initialize the weights.

For example, consider a dataset of people with features including height (in cm), the number of steps walked in a day, and a Boolean which specifies whether they have passed high school. These use very different units, and might have very different utility in predictions. To make it learn independently of the units, it is typical to normalize each real-valued feature, by subtracting the mean, so the result has a mean of 0, and dividing by the standard deviation, so the result has a standard deviation and variance of 1. Each feature is scaled independently.

Categorical inputs are usually represented as indicator variables, so that categorical variable $X$ with domain $\{v_{1},\dots,v_{k}\}$ is represented as $k$ inputs, $X_{1},\dots,X_{k}$. An input example with $X=v_{j}$ is represented with $X_{j}=1$ and every other $X_{j^{\prime}}=0$. This is also called a one-hot encoding.

The weights for the linear functions in hidden layers cannot all be assigned the same value, otherwise they will converge in lock-step, never learning different features. If the $w[i,j]$ parameters are initialized to the same value, all of the units at one layer will implement the same function. Thus they need to be initialized to random values. In Keras, the default initializer is the Glorot uniform initializer, which chooses values uniformly in the interval between $\pm\sqrt{6/(n_{i}+n_{0})}$, where $n_{i}$ is the number of inputs and $n_{o}$ is the number of outputs for the linear function. This has a theoretical basis and tends to work well in practice for many cases. The biases (the weights multiplied by input unit 1) are typically initialized to 0.

For the output units, the bias can be initialized to the value that would predict best when all of the other inputs are zero. This is the mean for regression, or the inverse-sigmoid of the empirical probability for binary classification. The other output weights can be set to 0. This allows the gradient descent to learn the signal from the inputs, and not the background average.