The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

# 7.4.2 Regularization

Ockham’s razor specifies that we should prefer simpler models over more complex models. Instead of just optimizing fit-to-data, as done in Section 7.2.1, we can optimize fit-to-data plus a term that rewards simplicity and penalizes complexity. The penalty term is a regularizer.

The typical form for a regularizer is to find a hypothesis $h$ to minimize:

 $\left(\sum_{e}error(e,h)\right)+\lambda*regularizer(h)$ (7.4)

where the $error(e,h)$ is the error of example $e$ for hypothesis $h$, which specifies how well hypothesis $h$ fits example $e$. The regularization parameter, $\lambda$, trades off fit-to-data and model simplicity, and $regularizer(h)$ is a penalty term that penalizes complexity or deviation from the mean. Notice that as the number of examples increases the leftmost sum tends to dominate and the regularizer has little effect. The regularizer has most effect when there are few examples. The regularization parameter is needed because the error and complexity terms are typically in different units. The regularization parameter can be chosen by prior knowledge, past experience with similar problems or by cross validation.

For example, in learning a decision tree one complexity measure is the number of splits in the decision tree (which is one less than the number of leaves for a binary decision tree). When building a decision tree, we could optimize the sum-of-squares error plus a function of the size of the decision tree, minimizing

 $\left(\sum_{e\in E}({Y}({e})-\widehat{Y}({e}))^{2}\right)+\lambda*|tree|$

where $|tree|$ is the number of splits in the tree. When splitting, a single split is worthwhile if it reduces the sum-of-squares error by $\lambda$.

For models where there are real-valued parameters, an $L_{2}$ regularizer, penalizes the sum of squares of the parameters. To optimize the sum-of-squares error for linear regression with an $L_{2}$ regularizer, minimize

 $\displaystyle\left(\sum_{e\in E}\left({Y}({e})-\sum_{i=0}^{n}w_{i}*{X_{i}}({e}% )\right)^{2}\right)+\lambda\left(\sum_{i=0}^{n}w_{i}^{2}\right)$

which is known as ridge regression.

To optimize the log loss error for logistic regression with an $L_{2}$ regularizer, minimize

 $\displaystyle-\left(\sum_{e\in E}\left({Y}({e})\log\widehat{Y}({e})+(1-{Y}({e}% ))\log(1-\widehat{Y}({e}))\right)\right)+\lambda\left(\sum_{i=0}^{n}w_{i}^{2}\right)$

where $\widehat{Y}({e})=sigmoid\left(\sum_{i=0}^{n}w_{i}*{X_{i}}({e})\right)$.

An $L_{2}$ regularization is implemented by adding

 $\displaystyle w_{i}$ $\displaystyle\;{:}{=}\;\mbox{}w_{i}-\eta*(\lambda/|E|)*w_{i}$

after line 18 of Figure 7.8 or after line 18 of Figure 7.10 (in the scope of both “for each”). This divides by the number of examples ($|E|$) because it is carried out once for each example. It is also possible to regularize after each iteration through all of the examples, in which case the regularizer should not divide by the number of examples. Note that $\eta*\lambda/|E|$ does not change as so should be computed once and stored.

An $L_{1}$ regularizer adds a penalty for the sum of the absolute values of the parameters.

Adding an $L_{1}$ regularizer to the log loss entails minimizing

 $\displaystyle-\left(\sum_{e\in E}\left({Y}({e})\log\widehat{Y}({e})+(1-{Y}({e}% ))\log(1-\widehat{Y}({e}))\right)\right)+\lambda\left(\sum_{i=0}^{n}\left|w_{i% }\right|\right).$

The partial derivative of the sum of absolute values with respect to $w_{i}$ is the sign of $w_{i}$, either $1$ or $-1$ (defined as $sign(w_{i})=w_{i}/|w_{i}|$), at every point except at 0. We do not need to make a step at 0, because the value is already a minimum. To implement an $L_{1}$ regularizer, each parameter is moved towards zero by a constant, except if that constant would change the sign of the parameter, in which case the parameter becomes zero. Thus, an $L_{1}$ regularizer can be incorporated into the logistic regression gradient descent algorithm of Figure 7.10 by adding after line 18 (in the scope of both “for each”):

 $\displaystyle w_{i}\;{:}{=}\;\mbox{}sign(w_{i})*max(0,\>|w_{i}|-\eta*\lambda/|% E|)$

This is called iterative soft-thresholding and is a special case of the proximal-gradient method.

An $L_{1}$ regularizer when there are many features tends to make many weights zero, which means the corresponding feature is ignored. This is a way to implement feature selection. An $L_{2}$ regularizer tends to make all of the parameters smaller, but not zero.

Regularization, Pseudocounts and Probabilistic Mixtures

Consider the simplest case of a learner that takes a sequence $E$ of examples $e_{1}\dots e_{n}$, with no input features. Suppose you regularize to some default value $m$, and so penalize the difference from $m$. The regularizers in Section 7.4.2 regularize to 0.

Consider the following programs that do stochastic gradient descent with $L_{2}$ regularization in different ways. Each takes in the data set $E$, the value for $m$, the learning rate $\eta$ and the regularization parameter $\lambda$.

procedure $Learn_{0}$($E,m,\eta,\lambda$)
$p\;{:}{=}\;\mbox{}m$

repeat

for each $e_{i}\in E$ do
$p\;{:}{=}\;\mbox{}p-\eta*(p-e_{i})$                    $p\;{:}{=}\;\mbox{}p-\eta*\lambda*(p-m)$
until termination

return $p$
procedure $Learn_{1}$($E,m,\eta,\lambda$)
$p\;{:}{=}\;\mbox{}m$

repeat

for each $e_{i}\in E$ do

$p\;{:}{=}\;\mbox{}p-\eta*(p-e_{i})$
$p\;{:}{=}\;\mbox{}p-\eta*\lambda*(p-m)$                until termination
return $p$

The programs differ as to whether the regularization happens for each element of the data set or for the whole data set at each iteration.

Program $Learn_{0}$ minimizes $\bigg{(}\sum_{i}(p-e_{i})^{2}\bigg{)}+\lambda(p-m)^{2}$ which is minimal when $p=\frac{m\lambda+\sum_{i}e_{i}}{\lambda+n}.$ Program $Learn_{1}$ minimizes $\sum_{i}\bigg{(}(p-e_{i})^{2}+\lambda(p-m)^{2}\bigg{)}$ which is minimal when $p=\frac{\lambda}{1+\lambda}m+\frac{1}{1+\lambda}\frac{\sum_{i}e_{i}}{n}.$

Program $Learn_{0}$ is equivalent to having a pseudocount with $\lambda$ extra examples, each with value $m$.

Program $Learn_{1}$ is equivalent to a probabilistic mixture of $m$ and the average of the data.

For a fixed number of examples, $n$, these can be mapped into each other; $\lambda$ for $Learn_{1}$ is $\lambda$ for $Learn_{0}$ divided by $n$. They act differently when the number of examples varies, for example, in cross validation, when using a single $\lambda$ for multiple data sets, or in more complicated cases such as collaborative filtering.

For a fixed $\lambda$, with $n$ varying, they are qualitatively different. In $Learn_{0}$, as the number of examples increases the regularization gets less and less important. In $Learn_{1}$, $m$ has the same effect on the prediction, no matter what $n$ is. Using the strategy of $Learn_{0}$ is appropriate if the examples are independent of each other, where it is appropriate that enough examples will dominate any prior model. The strategy of $Learn_{1}$ may appropriate if there is some chance the whole data set is misleading.