8.3 Improving Generalization

When building a new model, it is often useful to ensure initially that the model can fit the training data, and then tackle overfitting.

First make sure it is learning something. The error on the training set should be able to beat the naive baseline corresponding to the loss being evaluated; for example, it should do better than the mean of the training data targets for squared loss or log loss. If it does not, the algorithm is not learning, and the algorithm, architecture, optimization algorithm, step size, or initialization might need to be changed. You should try one of the algorithms from the previous chapter, as well as a neural network.

If it beats the naive baseline, but does not perform as well as you might expect on the training set, try changing the model. You know it won’t do as well on the test set and on new examples as it does on the training set, so if it is poor on the training set, it will be poor on new cases. Poor performance on the training set is an indication of under fitting; the model is too simple to represent the data. One example of under fitting is using logistic regression for a function that is not linearly separable (see Example 7.11). Try increasing the capacity (e.g., the width and/or depth), but also check other algorithms and parameter settings. It also might be the case that the input features do not contain enough information to predict the target features, in which case the naive baseline might be the best you can do.

Once you know the algorithm can at least fit the training set, test the error on validation set. If the validation error does not improve as the algorithm proceeds, it means the learning is not generalizing, and it is fitting to noise. In this case, it is probably overfitting and the model should be simplified. When there is little data, models with more than one or two hidden layers tend to severely overfit. In general, small amounts of data require small models.

At this stage it is useful to carry out hyperparameter tuning using cross validation. Automating hyperparameter tuning, a process known as autoML, is often the best way to select the hyperparameters.

When there is little data available, k-fold cross validation works well, but might not be needed if there is a lot of data.

Some of the hyperparameters that can be tuned include:

  • the algorithm (a decision tree or gradient-boosted trees may be more appropriate than a neural network)

  • number of layers

  • width of each layer

  • number of epochs, to allow for early stopping

  • learning rate

  • batch size

  • regularization parameters; L1 and L2 regularization are often useful when there is little data.

One effective mechanism for deep networks is dropout, which involves randomly dropping some units during training. Ignoring a unit is equivalent to temporarily setting its output to zero. Dropout is controlled by a parameter rate, which specifies the proportion of values that are zeroed. It is common that rate is 0.5 for hidden units, and 0.2 for input units. These probabilities are applied to each unit independently for each example in a batch.

This can be implemented by treating dropout as a layer consisting of a single function, as in Figure 8.7. This function is used when learning, but not when making a prediction for a new example.

1: class Dropout(rate) rate is probability of an input being zeroed
2:      method output(in) in is array with length ni
3:          scaling:=1/(1rate)
4:          for each i[0,ni) do
5:               mask[i]:=0 with probability rate else 1
6:               out[i]:=in[i]mask[i]scaling          
7:          return out      
8:      method Backprop(error) error is array with length ni
9:          for each i[0,ni) do
10:               ierror[i]:=error[i]mask[i]          
11:          return ierror      
12:
Figure 8.7: Pseudocode for dropout

If rate=0.5, half of the values are used, and so the sum would be half, on average, of what it would be without dropout. To make the learner with dropout comparable to the one without, the output should be doubled. In general, the output needs to be scaled by 1/(1rate), which is the value of scaling in Figure 8.7.

Improving the algorithm is not the only way to improve the prediction. Other methods that are useful to building a better model include the following.

  • Collecting more data is sometimes the most cost-effective way to improve a model.

  • Sometimes more data can be obtained by data augmentation: using the existing data, for example in recognizing objects from images by translating, scaling, or rotating the image (but be careful it doesn’t change the class, e.g., a rotation of a “6” might become a “9”), adding noise or changing the context (e.g., once you have a picture of a cat, put it in different contexts).

  • Feature engineering is still useful when the data is limited or there are limited resources for trainings. For example, there are many representations for the positions of a hand on a clock, and some are much easier to learn with than others; it is much easier to learn the time from the angles of the hands than from an image of the clock.

  • It is sometimes easier to learn a model for a task for which there is limited data, by sharing the lower-level features among multiple tasks. The lower-level features then have many more examples to learn from than they would with any single task. This is known as multi-task learning. In a neural network, it can be achieved by sharing the lower layers (those closest to the inputs), with the different tasks having their own higher layers. When learning a task, all of the weights for the units used for the task (including the shared lower-level units) are updated. An alternative, which is used when one task has already been learned, is for a new task with limited data to use the lower-level features of the original task and only learn higher-level features for the new task. This is explored more in Section 8.5.5.