# 7.3 Basic Models for Supervised Learning

Supervised learning methods take the input features, the target features, and the training data and return predictors, functions on input features that predict values for the target features. Learning methods are often characterized by how the predictors are represented. This section considers some basic methods from which other methods are built. Initially assume a single target feature and the learner returns a predictor on this target.

## 7.3.1 Learning Decision Trees

A decision tree is a simple representation for classifying examples. Decision tree learning is one of the simplest useful techniques for supervised classification learning.

A decision tree is a tree in which

• each internal (non-leaf) node is labeled with a condition, a Boolean function of examples

• each internal node has two branches, one labeled $true$ and the other $false$

• each leaf of the tree is labeled with a point estimate.

Decision trees are also called classification trees when the target (leaf) is a classification, and regression trees when the target is real-valued.

To classify an example, filter it down the tree, as follows. Each condition encountered in the tree is evaluated and the arc corresponding to the result is followed. When a leaf is reached, the classification corresponding to that leaf is returned. A decision tree corresponds to a nested if–then–else structure in a programming language.

###### Example 7.8.

Figure 7.8 shows two possible decision trees for the examples of Figure 7.1. Each decision tree can be used to classify examples according to the user’s action. To classify a new example using the tree on the left, first determine the length. If it is long, predict $skips$. Otherwise, check the thread. If the thread is new, predict $reads$. Otherwise, check the author and predict $reads$ only if the author is known. This decision tree can correctly classify all examples in Figure 7.1.

The left tree corresponds to the program defining $\widehat{UserAction}({e})$:

 define $\widehat{UserAction}({e})$: if $long(e)$: return $skips$ else if $new(e)$: return $reads$ else if $unknown(e)$: return $skips$ else: return $reads$

The tree on the right returns a numerical prediction for $reads$:

 define $\widehat{UserAction}({e})$: if $long(e)$: return 0 else $new(e)$: return 0.82

To use decision trees as a target representation, there are a number of questions that arise.

• Given some training examples, what decision tree should be generated? Because a decision tree can represent any function of discrete input features, the bias that is necessary is incorporated into the preference of one decision tree over another. One proposal is to prefer the smallest tree that is consistent with the data, which could mean the tree with the least depth or the tree with the fewest nodes. Which decision trees are the best predictors of unseen data is an empirical question.

• How should an agent go about building a decision tree? One way is to search the space of decision trees for the smallest decision tree that fits the data. Unfortunately, the space of decision trees is enormous (see Exercise 7.7). A practical solution is to carry out a greedy search on the space of decision trees, with the goal of minimizing the error. This is the idea behind the algorithm described below.

### Searching for a Good Decision Tree

A decision tree can be seen as a branching program, that takes an example and returns a prediction for that example. The program is either

• a function that ignores its argument and returns a point prediction for all of the examples that reach this point (which corresponds to a leaf), or

• of the form “if $c(e)$ then $t_{1}(e)$ else $t_{0}(e)$” where $c$ is a Boolean condition, and $t_{1}$ and $t_{0}$ are decision trees; $t_{1}$ is the tree that is used when the condition $c$ is true of example $e$, and $t_{0}$ is the tree used when $c(e)$ is false.

The algorithm $Decision\_tree\_learner$ of Figure 7.9 mirrors the recursive decomposition of a tree. It builds a decision tree from the top down as follows. The input to the algorithm is a set of input conditions (Boolean functions of examples that use only input features), the target feature, a set of training examples, and a real-valued parameter, $\gamma$, discussed below. If the input features are Boolean, they can be used directly as the conditions.

A greedy optimal split is a condition that results in the lowest error if the learner were allowed only one split and it splits on that condition. $sum\_loss(Es)$ gives the sum of losses of training examples $Es$ for the loss function assumed, given that the optimal prediction is used for that loss function, as given in Figure 7.5. The procedure $select\_split$ returns a greedy optimal split if the sum of losses after the split improves the sum of losses before the split by at least the threshold $\gamma$. If there is no such condition, $select\_split$ returns $None$.

Adding a split increases the size of the tree by 1. The threshold $\gamma$ can be seen as a penalty for increasing the size of the tree by 1. If positive, $\gamma$ is also useful to prevent $val holding solely due to rounding error.

If $select\_split$ returns $None$, the decision tree learning algorithm creates a leaf. The function $leaf\_value(Es)$ returns the value that is used as a prediction of the target for examples $Es$ that reach this node. It ignores the input features for these examples and returns a point estimate, which is the case considered in Section 7.2.2. The decision tree algorithm returns a function that takes an example and returns that point estimate.

If $select\_split$ returns condition $c$, the learner splits on condition $c$ by partitioning the training examples into those examples $e$ with $c(e)$ true and those examples with $c(e)$ false. It recursively builds a subtree for each of these bags of examples. It returns a function that, given an example, tests whether $c$ is true of the example, and then uses the prediction from the appropriate subtree.

###### Example 7.9.

Consider applying $Decision\_tree\_learner$ to the classification data of Figure 7.1, with $\gamma=0$. The initial call is

 $decisionTreeLearner(\{known,new,long,home\},User\_action,$ $\{e_{1},e_{2},\ldots,e_{18}\},0)$

where $known$ is true when $Author{\,{=}\,}known$, and similarly for the other conditions.

Suppose the stopping criterion is not true and the algorithm selects the condition $long$ to split on. It then calls

 $decisionTreeLearner(\{known,new,home\},User\_action,$ $\{e_{1},e_{3},e_{4},e_{6},e_{9},e_{10},e_{12}\},0)$

where $\{e_{1},e_{3},e_{4},e_{6},e_{9},e_{10},e_{12}\}$ is the set of training examples with $Length{\,{=}\,}long$.

All of these examples agree on the user action; therefore, the algorithm returns the prediction $skips$. The second step of the recursive call is

 $decisionTreeLearner(\{known,new,home\},User\_action,$ $\{e_{2},e_{5},e_{7},e_{8},e_{11},e_{13},e_{14},e_{15},e_{16},e_{17},e_{18}\},0).$

Not all of the examples agree on the user action, so assuming the stopping criterion is false, the algorithm selects a condition to split on. Suppose it selects $new$. Eventually, this recursive call returns the function on example $e$ in the case when $Length$ is $short$:

 $\displaystyle\mbox{if }new(e)$ $\displaystyle\mbox{ then }reads$ $\displaystyle\mbox{ else if }unknown(e)\mbox{ then }skips\mbox{ else }reads.$

The final result is the first predictor of Example 7.8.

When the loss is log loss with base 2, the mean of the losses, $sum\_losses(Es)/|Es|$, is the entropy of the empirical distribution of $|Es|$. The number of bits to describe $Es$ after testing the condition $c$ is $val$, defined on line 25 of Figure 7.9. The entropy of the distribution created by the split is $val/|Es|$. The difference of these two is the information gain of the split. Sometimes information gain is used even when the optimality criterion is some other error measure, for example, when maximizing accuracy it is possible to select a split to optimize log loss, but return the mode as the leaf value. See Exercise 7.6.

The following example shows details of the split choice for the case where the split is chosen using log loss, and the empirical distribution is used as the leaf value.

###### Example 7.10.

In the running example of learning the user action from the data of Figure 7.1, suppose the aim is to minimize the log loss. The algorithm greedily chooses a split that minimizes the log loss. Suppose $\gamma$ is 0.

Without any splits, the optimal prediction on the training set is the empirical frequency. There are nine examples with $User\_action{\,{=}\,}reads$ and nine examples with $User\_action{\,{=}\,}skips$, and so $known$ is predicted with probability 0.5. The mean log loss is equal to $(-18*\log_{2}0.5)/18=1$.

Consider splitting on $Author$. This partitions the examples into $[e_{1}$, $e_{4}$, $e_{5}$, $e_{6}$, $e_{9}$, $e_{10}$, $e_{12}$, $e_{13}$, $e_{14}$, $e_{15}$, $e_{16}$, $e_{17}]$ with $Author{\,{=}\,}known$ and $[e_{2}$, $e_{3}$, $e_{7}$, $e_{8}$, $e_{11}$, $e_{18}]$ with $Author{\,{=}\,}unknown$, each of which is evenly split between the different user actions. The optimal prediction for each partition is again 0.5, and so the log loss after the split is again 1. In this case, finding out whether the author is known, by itself, provides no information about what the user action will be.

Splitting on $Thread$ partitions the examples into $[e_{1}$, $e_{2}$, $e_{5}$, $e_{8}$, $e_{10}$, $e_{12}$, $e_{14}$, $e_{15}$, $e_{17}$, $e_{18}]$ with $Thread{\,{=}\,}new$ and $[e_{3}$, $e_{4}$, $e_{6}$, $e_{7}$, $e_{9}$, $e_{11}$, $e_{13}$, $e_{16}]$ with $Thread{\,{=}\,}followup$. The examples with $Thread{\,{=}\,}new$ contains three examples with $User\_action{\,{=}\,}skips$ and seven examples with $User\_action{\,{=}\,}reads$, thus the optimal prediction for these is to predict reads with probability $7/10$. The examples with $Thread=followup$ have two $reads$ and six $skips$. Thus, the best prediction for these is to predict $reads$ with probability $2/8$. The mean log loss after the split is

 $\displaystyle-(3*\log_{2}(3/10)+7*\log_{2}(7/10)+2*\log_{2}(2/8)+6*\log_{2}(6/% 8))/18$ $\displaystyle\approx 15.3/18\approx 0.85.$

Splitting on $Length$ divides the examples into $[e_{1}$, $e_{3}$, $e_{4}$, $e_{6}$, $e_{9}$, $e_{10}$, $e_{12}]$ and $[e_{2}$, $e_{5}$, $e_{7}$, $e_{8}$, $e_{11}$, $e_{13}$, $e_{14}$, $e_{15}$, $e_{16}$, $e_{17}$, $e_{18}]$. The former all agree on the value of $User\_action$ and predict with probability 1. The user action divides the second set $9:2$, and so the mean log loss is

 $-(7*\log_{2}1+9*\log_{2}9/11+2*\log_{2}2/11)/18\approx 7.5/18\approx 0.417.$

Therefore, splitting on $Length$ is better than splitting on $Thread$ or $Author$, when greedily optimizing the log loss.

### Constructing Conditions

In the decision tree learning algorithm (Figure 7.9), Boolean input features can be used directly as the conditions. Non-Boolean input features are handled in a number of ways.

• Suppose input variable $X$ is categorical, with domain $\{v_{1},\dots,v_{k}\}$. A binary indicator variable, $X_{i}$, can be associated with each value $v_{i}$, where $X_{i}(e)=1$ if $X(e)=v_{i}$ and $X_{i}(e)=0$ otherwise. For each example $e$, exactly one of $X_{1}(e),\dots,X_{k}(e)$ is 1 and the others are 0.

• When the domain of a feature is totally ordered, the feature is called an ordinal feature. This includes real-valued features as a special case, but might also include other features such as clothing sizes (S, M, L, XL, etc.), and highest level of education (none, primary, secondary, bachelor, etc.).

For an ordinal input feature $X$ and for a given value $v$, a Boolean feature can be constructed as a cut: a new feature that has value 1 when $X>v$ and 0 otherwise. Combining cuts allows for features that are true for intervals; for example, a branch might include the conditions $X>9$ is true and $X>17$ is false, which corresponds to the interval $9.

Suppose the domain of input variable $X$ is totally ordered. To select the optimal value for the cut value $v$, sort the examples on the value of $X$ and sweep through the examples to consider each split value and select the best. See Exercise 7.8.

• For ordinal features (including real-valued features), binning involves choosing a set of thresholds, creating a feature for each interval between the thresholds. The thresholds $\alpha_{1}<\alpha_{2}<\dots<\alpha_{k}$, make $k+1$ Boolean features, one that is true for $X$ when $X\leq\alpha_{1}$, one for $\alpha_{k}, and one for $\alpha_{i} for each $i\leq i. A bin of the form $\alpha_{i} would require two splits to represent using cuts. The $\alpha_{i}$ can be chosen upfront, for example, using percentiles of the training data, or chosen depending on the target.

• For categorical feature $X$, there might be a better split of the form $X\in S$ where $S$ is a set of values, rather than only splitting on a single value, as is done with indicator variables. When the target $Y$ is Boolean, to find an appropriate set $S$, sort the values of $X$ by the proportion of $Y$ that are true; a greedy optimal split will be between values in this sorted list.

• It is possible to expand the algorithm to allow multiway splits. To split on a multivalued variable, there would be a child for each value in the domain of the variable. This means that the representation of the decision tree becomes more complicated than the simple if–then–else form used for binary features. There are two main problems with this approach. The first is what to do with values of a feature for which there are no training examples. The second is that for most greedy splitting heuristics, including information gain, it is generally better to split on a variable with a larger domain because it produces more children and so can fit the data better than splitting on a feature with a smaller domain. However, splitting on a feature with a smaller domain keeps the representation more compact. A four-way split, for example, is equivalent to three binary splits; they both result in four leaves.

### Alternative Design Choices

The algorithm does not split when $select\_split$ returns $None$. This occurs when there are no examples, when there are no conditions remaining, when all examples have the same value on each condition, when all of the examples have the same target value, and when the improvement of the evaluation is less than the parameter $\gamma$. A number of other criteria have been suggested for stopping earlier.

• Minimum child size: do not split more if one of the children will have fewer examples than a threshold.

• Maximum depth: do not split more if the depth reaches a maximum.

It is possible that one condition may only work well in conjunction with other conditions, and the greedy method may not work when this occurs. One particularly tricky case is a parity function of $k$ Boolean variables that is true if an odd (or even) number of variables are true; knowing the values of fewer than $k$ of the variables gives no information about the value of the parity function. The simplest parity functions (for $k=2$) are exclusive-or and equivalence. Parity functions have complicated decision trees.

In some cases, greedy splitting does not find a simplest decision tree and it is often useful to simplify the tree resulting from the top-down algorithm, as shown in the following example.

###### Example 7.11.

Consider a dataset with inputs $x$, $y$, and $z$ and target $t$. The target is true if $x$ is true and $y$ is true, or $x$ is false and $z$ is true. Figure 7.10 (a) shows a tree representation of this function. This tree can generate the data in the center (b).

Although the simplest tree first splits on $x$, splitting on $x$ provides no information; there is the same proportion of $t$ true when $x$ is true as when $x$ is false. Instead, the algorithm can split on $y$. When $y$ is true, there is a larger proportion of $t$ true than when $y$ is false. For the case where $y$ is true, splitting on $x$ perfectly predicts the target when $x$ is true. The resulting tree is given in Figure 7.10(c). Following the paths to $t{\,{=}\,}1$, this tree corresponds to $t$ being true when $(x\land y)\vee(y\land\neg x\land z)\vee(\neg y\land\neg x\land z)$, which can be simplified to $(x\land y)\vee(\neg x\land z)$. This is essentially the original tree.

## 7.3.2 Linear Regression and Classification

Linear functions provide a basis for many learning algorithms. This section first covers regression, then considers classification.

Linear regression is the problem of fitting a linear function to a set of training examples, in which the input and target features are real numbers.

Suppose the input features, $X_{1},\dots,X_{m}$, are all real numbers (which includes the $\{0,1\}$ case) and there is a single target feature $Y$. A linear function of the input features is a function of the form

 $\displaystyle\widehat{Y}^{\overline{w}}({e})$ $\displaystyle=w_{0}+w_{1}*{X_{1}}({e})+\dots+w_{m}*{X_{m}}({e})$ $\displaystyle=\sum_{i=0}^{m}w_{i}*{X_{i}}({e})$

where $\overline{w}=\left$ is a vector (tuple) of weights, and $X_{0}$ is a special feature whose value is always 1.

Suppose $E$ is a set of examples. The mean squared loss on examples $E$ for target $Y$ is the error

 $\displaystyle error(E,\overline{w})$ $\displaystyle=\frac{1}{\left|E\right|}\sum_{e\in E}(\widehat{Y}^{\overline{w}}% ({e})-{Y}({e}))^{2}$ $\displaystyle=\frac{1}{\left|E\right|}\sum_{e\in E}\left(\sum_{i=0}^{m}w_{i}*{% X_{i}}({e})-{Y}({e})\right)^{2}.$ (7.1)

Consider minimizing the mean squared loss. There is a unique minimum, which occurs when the partial derivatives with respect to the weights are all zero. The partial derivative of the error in Equation 7.1 with respect to weight $w_{i}$ is

 $\frac{\partial}{\partial w_{i}}error(E,\overline{w})=\frac{1}{\left|E\right|}% \sum_{e\in E}2*\delta(e)*{X_{i}}({e})$ (7.2)

where $\delta(e)=\widehat{Y}^{\overline{w}}({e})-{Y}({e})$, a linear function of the weights. The weights that minimize the error can be computed analytically by setting the partial derivatives to zero and solving the resulting linear equations in the weights (see Exercise 7.11).

### Squashed Linear Functions

Consider binary classification, where the domain of the target variable is $\{0,1\}$.

A linear function does not work well for such classification tasks; a learner should never make a prediction greater than 1 or less than 0. However, a linear function could make a prediction of, say, 3 for one example just to fit other examples better.

A squashed linear function is of the form

 $\displaystyle\widehat{Y}^{\overline{w}}({e})$ $\displaystyle=\phi(w_{0}+w_{1}*{X_{1}}({e})+\dots+w_{m}*{X_{m}}({e}))$ $\displaystyle=\phi(\sum_{i}w_{i}*X_{i}(e))$

where $\phi$, an activation function, is a function from the real line $[-\infty,\infty]$ into some subset of the real line, such as $[0,1]$.

A prediction based on a squashed linear function is a linear classifier.

One differentiable activation function is the sigmoid or logistic function:

 $sigmoid(x)=\frac{1}{1+exp({-x})}$

where $exp(v)=e^{v}$, where $e$ is Euler’s number (approximately 2.718). The sigmoid function, depicted in Figure 7.11, squashes the real line into the interval $(0,1)$, which is appropriate for classification because you would never want to make a prediction of greater than 1 or less than 0. The sigmoid function can be justified in terms of probabilities. It is also differentiable, with derivative

 $\frac{d}{dx}sigmoid(x)=sigmoid(x)*(1-sigmoid(x)).$

The problem of determining weights for the sigmoid of a linear function that minimize an error on a set of examples is called logistic regression.

The mean log loss for logistic regression is

 $\displaystyle LL(E,\overline{w})=-\frac{1}{\left|E\right|}*\sum_{e\in E}\left(% {Y}({e})*\log\widehat{Y}({e})+(1-{Y}({e}))*\log(1-\widehat{Y}({e}))\right)$

where $\widehat{Y}({e})=sigmoid\left(\sum_{i=0}^{m}w_{i}*{X_{i}}({e})\right)$. To minimize this, consider weight $w_{i}$. The partial derivative with respect to weight $w_{i}$ is

 $\frac{\partial}{\partial w_{i}}LL(E,\overline{w})=\frac{1}{\left|E\right|}\sum% _{e\in E}\delta(e)*X_{i}(e)$ (7.3)

where $\delta(e)=\widehat{Y}^{\overline{w}}({e})-{Y}({e})$. This is very similar to Equation 7.2, the main difference is the definition of the predicted value. Unlike Equation 7.2, this is not a linear function of the parameters (because $\widehat{Y}^{\overline{w}}({e})$ is not linear in the parameters) and is difficult to solve analytically.

The problem of finding a set of parameters to minimize errors is an optimization problem; see Section 4.8.

Gradient descent is an iterative method to find a local minimum of a function. To find a set of weights to minimize an error, it starts with an initial set of weights. In each step, it decreases each weight in proportion to its partial derivative:

 $w_{i}\;{:}{=}\;\mbox{}w_{i}-\eta*\frac{\partial}{\partial w_{i}}error(E,% \overline{w})$

where $\eta$, the gradient descent step size, is called the learning rate. The learning rate, as well as the features and the data, is given as input to the learning algorithm. The partial derivative specifies how much a small change in the weight would change the error.

For linear regression with squared error and logistic regression with log loss, the derivatives, given in Equation 7.2 and Equation 7.3. For each of these (ignoring the constant factor of 2), gradient descent has the update

 $w_{i}\;{:}{=}\;\mbox{}w_{i}-\eta*\frac{1}{\left|E\right|}*\sum_{e\in E}\delta(% e)*X_{i}(e)$ (7.4)

where $\delta(e)=\widehat{Y}^{\overline{w}}({e})-{Y}({e})$.

A direct implementation of gradient descent does not update any weights until all examples have been considered. This can be very wasteful for large datasets. It is possible to make progress with a subset of the data. This gradient descent step takes a mean value. Often you can compute means approximately by using a random sample of examples. For example, you can get a good estimate of the mean height of a large population of people by selecting 100 or 1000 people at random and using their mean height.

Instead of using all of the data for an update, stochastic gradient descent uses a random sample of examples to update the weights. It is called stochastic because of the random sampling. Random sampling is explored more in Section 9.7. The set of $b$ examples used in each update is called a minibatch or a batch.

The stochastic gradient descent algorithm for logistic regression is shown in Figure 7.12. This returns a function, $pred$, that can be used for predictions on new examples. The algorithm collects the update for each weight $w_{i}$ for a batch in a corresponding $d_{i}$, and updates the weights after each batch. The learning rate $\eta$ is assumed to be per example, and so the update needs to be divided by the batch size.

An epoch is $\lceil{|Es|/b}\rceil$ batches, which corresponds to one pass through all of the data, on average. Epochs are useful when reporting results, particularly with different batch sizes.

###### Example 7.12.

Consider learning a squashed linear function for classifying the data of Figure 7.1. One function that correctly classifies the examples is

 $\widehat{Reads}({e})=sigmoid(-8+7*Short(e)+3*New(e)+3*Known(e))~{},$

where $f$ is the sigmoid function. A function similar to this can be found with about 3000 iterations of stochastic gradient descent with a learning rate $\eta=0.05$. According to this function, $\widehat{Reads}({e})$ is true (the predicted value for example $e$ is closer to 1 than 0) if and only if $Short(e)$ is true and either $New(e)$ or $Known(e)$ is true. Thus, in this case, the linear classifier learns the same function as the decision tree learner.

Smaller batch sizes tend to learn faster as fewer examples are required for an update. However, smaller batches may not converge to a local optimum solution, whereas more data, up to all of the data, will. To see this, consider being at an optimum. A batch containing all of the examples would end up with all of the $d_{i}$ being zero. However, for smaller batches, the weights will vary and later batches will be using non-optimal parameter settings and so use incorrect derivatives. It is common to start with small batch size and increase the batch size until convergence, or good enough performance has been obtained.

Incremental gradient descent, or online gradient descent, is a special case of stochastic gradient descent using minibatches of size 1. In this case, there is no need to store the intermediate values in $d_{i}$, but the weights can be directly updated. This is sometimes used for streaming data where each example is used once and then discarded. If the examples are not selected at random, it can suffer from catastrophic forgetting, where it fits the later data and forgets about earlier examples.

### Linear Separability

Each input feature can be seen as a dimension; $m$ features results in an $m$-dimensional space. A hyperplane in an $m$-dimensional space is a set of points that all satisfy a constraint that some linear function of the variables is zero. The hyperplane forms an $(m-1)$-dimensional space. For example, in a (two-dimensional) plane, a hyperplane is a line, and in a three-dimensional space, a hyperplane is a plane. A Boolean classification is linearly separable if there exists a hyperplane where the classification is true on one side of the hyperplane and false on the other side.

The $Logistic\_regression\_learner$ algorithm can learn any linearly separable binary classification. The error can be made arbitrarily small for arbitrary sets of examples if, and only if, the target classification is linearly separable. The hyperplane is the set of points where $\sum_{i}w_{i}*X_{i}=0$ for the learned weights $\overline{w}$. On one side of this hyperplane, the prediction is greater than 0.5; on the other side, the prediction is less than 0.5.

###### Example 7.13.

Figure 7.13 shows linear separators for “or” (a) and “and” (b). The dashed line separates the positive (true) cases from the negative (false) cases. One simple function that is not linearly separable is the exclusive-or (xor) function (c). There is no straight line that separates the positive examples from the negative examples. As a result, a linear classifier cannot represent, and therefore cannot learn, the exclusive-or function.

Suppose there are three input features $x$, $y$, and $z$, each with domain $\{0,1\}$, and the ground truth is the function “if $x$ then $y$ else $z$” (represented by $t$ in Figure 7.10). This function is depicted by the cube in Figure 7.13(d) with the origin ($x$, $y$, $z$ all zero) at the bottom left and the ground truth for $t$ labelled with $+$ and $-$. This function is not linearly separable.

The following example shows what happens in gradient descent for logistic regression when the data is not linearly separable.

###### Example 7.14.

Consider target $t$ from the previous example that is true if $x$ is true and $y$ is true, or $x$ is false and $z$ is true. The prediction of $t$ is not linearly separable, as shown in Figure 7.13(d) – there is no hyperplane that separates the positive and negative cases of $t$.

After 1000 epochs of gradient descent with a learning rate of 0.05, one run found the following weights (to two decimal points):

 $\displaystyle lin(e)=\mbox{}$ $\displaystyle-0.12*x(e)+4.06*y(e)+4.06*z(e)-3.98$ $\displaystyle\widehat{t}({e})=\mbox{}$ $\displaystyle sigmoid(lin(e))~{}.$

The linear function $lin$ and the prediction for each example are shown in Figure 7.14(b). Four examples are predicted reasonably well, and the other four are predicted with a value of approximately $0.5$. This function is quite stable with different initializations. Increasing the number of iterations makes the predictions approach 0, 1, or 0.5.

### Categorical Target Features

When the domain of the target variable is categorical with more than two values, indicator variables can be used to convert the classification to binary variables. These binary variables could be learned separately. Because exactly one of the values must be true for each example, the predicted probabilities should add to 1. One way to handle this is to learn for all-but-one value, and predict the remaining value as 1 minus the sum of the other values. This is effectively how the binary case works. However, this introduces an asymmetry in the values by treating one value differently from the other values. This is problematic because the errors for the other values accumulate, making for a poor prediction on the value treated specially; it’s even possible that the prediction for the remaining value is negative if the others sum to more than 1.

The standard alternative is to learn a linear function for each value of the target variable, exponentiate, and normalize. This has more parameters than necessary to represent the function (it is said to be over-parametrized) but treats all of the values in the same way. Suppose the target $Y$ is categorical with domain represented by the tuple of values $(v_{1},\dots,v_{k})$. The softmax function takes a vector (tuple) of real numbers, $(\alpha_{1},\dots,\alpha_{k})$, and returns a vector of the same size, where the $i$th component of the result is

 $softmax((\alpha_{1},\dots,\alpha_{k}))_{i}=\frac{exp(\alpha_{i})}{\sum_{j=1}^{% k}exp(\alpha_{j})}.$

This ensures that the resulting values are all positive and sum to 1, and so can be considered as a probability distribution.

Sigmoid and softmax are closely related:

 $\displaystyle sigmoid(x)$ $\displaystyle=\frac{1}{exp(-x)+1}$ $\displaystyle=\frac{exp(x)}{exp(0)+exp(x)}$ $\displaystyle=softmax((0,x))_{2}$

where $(0,x)$ corresponds to the values $(false,true)$ and $softmax((0,x))_{2}$ is the second component of the pair that results from the softmax. The second equality follows from multiplying the numerator and the denominator by $exp(x)$, and noticing that $exp(x)*exp(-x)=exp(0)=1$. Thus, $sigmoid$ is equivalent to $softmax$ where the false component is fixed to be 0.

A softmax, like a sigmoid, cannot represent zero probabilities.

The generalization of logistic regression to predicting a categorical feature is called softmax regression, multinomial logistic regression, or multinomial logit. It involves a linear equation for each value in the domain of the target variable, $Y$. Suppose $Y$ has domain $(v_{i},\dots,v_{k})$. The prediction for example $e$ is a tuple of $k$ values, $softmax((u_{1}(e),\dots,u_{k}(e)))$, where the $j$th component is the prediction for $Y=v_{j}$ and

 $u_{j}(e)=w_{0,j}+{X_{1}}({e})*w_{1,j}*+\dots+{X_{m}}({e})*w_{m,j}.$

This is typically optimized with categorical log loss.

Consider weight $w_{ij}$ that is used for input $X_{i}$ for output value $v_{j}$, and example $e$ that has $Y(e)=v_{q}$:

 $\displaystyle\frac{\partial}{\partial w_{ij}}$ $\displaystyle logloss(softmax((u_{1}(e),\dots,u_{k}(e))),v_{q})$ $\displaystyle=\frac{\partial}{\partial w_{ij}}-\log\left(\frac{exp(u_{q}(e))}{% \sum_{j}exp(u_{j}(e))}\right)$ $\displaystyle=\frac{\partial}{\partial w_{ij}}(\log(\sum_{j}exp(u_{j}(e)))-u_{% q}(e))$ $\displaystyle=((\widehat{Y}(e))_{j}-\mathbf{1}(j=q))*X_{i}$

where $\mathbf{1}(j=q)$ is 1 if $j$ is the index of the observed value, $v_{q}$, and $(\widehat{Y}(e))_{j}$ is the $j$th component of the prediction. This is the predicted value minus the actual value.

To implement this effectively, you need to consider how computers represent numbers. Taking the exponential of a large number can result in a number larger than the largest number able to be represented on the computer, resulting in overflow. For example, $exp(800)$ will overflow for most modern CPUs. Taking exponentials of a large negative number can result in a number that is represented as zero, resulting in underflow. For example, $exp(-800)$ results in zero on many CPUs. Adding a constant to each $\alpha_{i}$ in a softmax does not change the value of the softmax. To prevent overflow and prevent all values from underflowing, the maximum value can be subtracted from each value, so there is always a zero, and the rest are negative. On GPUs and similar parallel hardware, often lower precision is used to represent weights, and so it becomes more important to correct for underflow and overflow.

When there is a large number of possible values, the computation of the denominator can be expensive, as it requires summing over all values. For example, in natural language, we may want to predict the next word in a text, in which case the number of values could be up to a million or more (particularly when phrases and names such as “Mahatma Gandhi” are included). In this case, it is possible to represent the prediction in terms of a binary tree of the values, forming hierarchical softmax. This implements the same function as softmax, just more efficiently for large domains.

### Creating Input Features

The definitions of linear and logistic regression assume that the input features are numerical.

Categorical features can be converted into features with domain $\{0,1\}$ by using indicator variables, as was done for decision tree learning. This is known as a one-hot encoding.

A real-valued feature can be used directly as long as the target is a linear function of that input feature, when the other input features are fixed. If the target is not a linear function, often some transformation of the feature is used to create new features.

For ordinal features, including real-valued features, cuts can be used to define a Boolean feature from a real feature. For input feature $x$, choose a value $v$ and use a feature that is true if $x>v$, or equivalently $x-v>0$. It is also common to use binning. Binning involves choosing a set of thresholds, $\alpha_{1}<\alpha_{2}<\dots<\alpha_{k}$, and using a feature with domain $\{0,1\}$ for each interval between $\alpha_{i}$ and $\alpha_{i+1}$. Binning allows for a piecewise constant function. Constructing a feature using $max(x-v,0)$ allows for a connected piecewise linear approximation; this is the basis of the rectified linear unit (ReLU), further investigated in the next chapter.

Designing appropriate features is called feature engineering. It is often difficult to design good features. Gradient-boosted trees use conjunctions of input features. Learning features is part of representation learning; see Chapter 8.