foundations of computational agents
The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).
Ockham’s razor specifies that we should prefer simpler models over more complex models. Instead of just optimizing fit-to-data, as done in Section 7.2.1, we can optimize fit-to-data plus a term that rewards simplicity and penalizes complexity. The penalty term is a regularizer.
The typical form for a regularizer is to find a hypothesis to minimize:
(7.4) |
where the is the error of example for hypothesis , which specifies how well hypothesis fits example . The regularization parameter, , trades off fit-to-data and model simplicity, and is a penalty term that penalizes complexity or deviation from the mean. Notice that as the number of examples increases the leftmost sum tends to dominate and the regularizer has little effect. The regularizer has most effect when there are few examples. The regularization parameter is needed because the error and complexity terms are typically in different units. The regularization parameter can be chosen by prior knowledge, past experience with similar problems or by cross validation.
For example, in learning a decision tree one complexity measure is the number of splits in the decision tree (which is one less than the number of leaves for a binary decision tree). When building a decision tree, we could optimize the sum-of-squares error plus a function of the size of the decision tree, minimizing
where is the number of splits in the tree. When splitting, a single split is worthwhile if it reduces the sum-of-squares error by .
For models where there are real-valued parameters, an regularizer, penalizes the sum of squares of the parameters. To optimize the sum-of-squares error for linear regression with an regularizer, minimize
which is known as ridge regression.
An regularization is implemented by adding
after line 18 of Figure 7.8 or after line 18 of Figure 7.10 (in the scope of both “for each”). This divides by the number of examples () because it is carried out once for each example. It is also possible to regularize after each iteration through all of the examples, in which case the regularizer should not divide by the number of examples. Note that does not change as so should be computed once and stored.
An regularizer adds a penalty for the sum of the absolute values of the parameters.
Adding an regularizer to the log loss entails minimizing
The partial derivative of the sum of absolute values with respect to is the sign of , either or (defined as ), at every point except at 0. We do not need to make a step at 0, because the value is already a minimum. To implement an regularizer, each parameter is moved towards zero by a constant, except if that constant would change the sign of the parameter, in which case the parameter becomes zero. Thus, an regularizer can be incorporated into the logistic regression gradient descent algorithm of Figure 7.10 by adding after line 18 (in the scope of both “for each”):
This is called iterative soft-thresholding and is a special case of the proximal-gradient method.
An regularizer when there are many features tends to make many weights zero, which means the corresponding feature is ignored. This is a way to implement feature selection. An regularizer tends to make all of the parameters smaller, but not zero.
Regularization, Pseudocounts and Probabilistic Mixtures
Consider the simplest case of a learner that takes a sequence of examples , with no input features. Suppose you regularize to some default value , and so penalize the difference from . The regularizers in Section 7.4.2 regularize to 0.
Consider the following programs that do stochastic gradient descent with regularization in different ways. Each takes in the data set , the value for , the learning rate and the regularization parameter .
procedure ()The programs differ as to whether the regularization happens for each element of the data set or for the whole data set at each iteration.
Program minimizes which is minimal when Program minimizes which is minimal whenProgram is equivalent to having a pseudocount with extra examples, each with value .
Program is equivalent to a probabilistic mixture of and the average of the data.
For a fixed number of examples, , these can be mapped into each other; for is for divided by . They act differently when the number of examples varies, for example, in cross validation, when using a single for multiple data sets, or in more complicated cases such as collaborative filtering.
For a fixed , with varying, they are qualitatively different. In , as the number of examples increases the regularization gets less and less important. In , has the same effect on the prediction, no matter what is. Using the strategy of is appropriate if the examples are independent of each other, where it is appropriate that enough examples will dominate any prior model. The strategy of may appropriate if there is some chance the whole data set is misleading.