foundations of computational agents
Ockham’s razor specifies that we should prefer simpler models over more complex models. Instead of just optimizing fit-to-data, as done in Section 7.2.1, we can optimize fit-to-data plus a term that rewards simplicity and penalizes complexity. The penalty term is a regularizer.
The typical form for a regularizer is to find a hypothesis to minimize:
where the is the error of example for hypothesis , which specifies how well hypothesis fits example . The regularization parameter, , trades off fit-to-data and model simplicity, and is a penalty term that penalizes complexity or deviation from the mean. Notice that as the number of examples increases the leftmost sum tends to dominate and the regularizer has little effect. The regularizer has most effect when there are few examples. The regularization parameter is needed because the error and complexity terms are typically in different units. The regularization parameter can be chosen by prior knowledge, past experience with similar problems or by cross validation.
For example, in learning a decision tree one complexity measure is the number of splits in the decision tree (which is one less than the number of leaves for a binary decision tree). When building a decision tree, we could optimize the sum-of-squares error plus a function of the size of the decision tree, minimizing
where is the number of splits in the tree. When splitting, a single split is worthwhile if it reduces the sum-of-squares error by .
For models where there are real-valued parameters, an regularizer, penalizes the sum of squares of the parameters. To optimize the sum-of-squares error for linear regression with an regularizer, minimize
which is known as ridge regression.
An regularization is implemented by adding
after line 18 of Figure 7.8 or after line 18 of Figure 7.10 (in the scope of both “for each”). This divides by the number of examples () because it is carried out once for each example. It is also possible to regularize after each iteration through all of the examples, in which case the regularizer should not divide by the number of examples. Note that does not change as so should be computed once and stored.
An regularizer adds a penalty for the sum of the absolute values of the parameters.
Adding an regularizer to the log loss entails minimizing
The partial derivative of the sum of absolute values with respect to is the sign of , either or (defined as ), at every point except at 0. We do not need to make a step at 0, because the value is already a minimum. To implement an regularizer, each parameter is moved towards zero by a constant, except if that constant would change the sign of the parameter, in which case the parameter becomes zero. Thus, an regularizer can be incorporated into the logistic regression gradient descent algorithm of Figure 7.10 by adding after line 18 (in the scope of both “for each”):
This is called iterative soft-thresholding and is a special case of the proximal-gradient method.
An regularizer when there are many features tends to make many weights zero, which means the corresponding feature is ignored. This is a way to implement feature selection. An regularizer tends to make all of the parameters smaller, but not zero.