#### 7.3.2.1 Squashed Linear Functions

The use of a linear function does not work well for classification
tasks. When there are only two values, say *0* and *1*, a learner
should never make a prediction of greater than 1 or less than 0. However,
a linear function could make a
prediction of, say, 3 for one example just to fit other examples better.

Initially let's consider binary classification,
where the domain of the target
variable is *{0,1}*. If multiple binary target variables exist,
they can be learned separately.

For classification, we often use a
**squashed linear function** of the form

f^{w}(X_{1},...,X_{n}) = f( w_{0}+w_{1}×X_{1}+ ...+ w_{n}×X_{n}) ,

where *f* is an **activation function**, which is a function from
real numbers into *[0,1]*. Using a squashed linear function to
predict a value for the target feature means
that the prediction for example *e* for target feature *Y* is

pval^{w}(e,Y)=f(w_{0}+w_{1}×val(e,X_{1}) + ...+ w_{n}×val(e,X_{n})) .

A simple activation function is the
**step function**, *f(x)*, defined by

f(x)=

1 if x ≥ 0 0 if x< 0 .

A step function was the basis for the **perceptron**
[Rosenblatt (1958)], which was one of the first methods developed for
learning. It is difficult to adapt gradient
descent to step functions because gradient descent takes derivatives and
step functions are not differentiable.

If the activation is differentiable, we can use gradient descent to update the weights. The sum-of-squares error is

Error_{E}(w) = ∑_{e∈E}(val(e,Y)-f(∑_{i}w_{i}×val(e,X_{i})))^{2}.

The partial derivative with respect to weight *w _{i}* for example

*e*is

∂Error_{E}(w)/∂w_{i}=-2×δ×f'(∑_{i}w_{i}×val(e,X_{i})) ×val(e,X_{i}) .

where *δ=val(e,Y)-pval ^{w}(e,Y)*, as before.
Thus, each example

*e*updates each weight

*w*as follows:

_{i}

w _{i}← w _{i}+η×δ×f'(∑_{i}w_{i}×val(e,X_{i})) ×val(e,X_{i}) .

A typical differentiable activation function is the **sigmoid** or **logistic function**:

f(x)= 1/(1+e^{-x}) .

This function, depicted in Figure 7.7, squashes the
real line into the interval *(0,1)*, which is appropriate for
classification
because we would never want to make a prediction of greater than 1 or less than
0. It is also differentiable, with
a simple derivative - namely, *f'(x)=f(x)×(1-f(x))* - which can be
computed using just the values of the outputs.

The *Linear Learner* algorithm of Figure 7.6 can be changed to use the sigmoid
function by changing
line 17 to

w_{i}←w_{i}+η×δ×pval^{w}(e,Y)×[1-pval^{w}(e,Y)] ×val(e,X_{i}) .

where *pval ^{w}(e,Y)=f(∑_{i} w_{i} ×val(e,X_{i}))* is the predicted value
of feature

*Y*for example

*e*.

**Example 7.10:**Consider learning a squashed linear function for classifying the data of Figure 7.1. One function that correctly classifies the examples is

Reads = f(-8+7×Short + 3 ×New + 3 ×Known) ,

where *f* is the sigmoid function. A function similar to this can be found with
about 3,000 iterations of gradient descent with a learning rate *η=0.05*. According to this function, *Reads* is true (the predicted value is
closer to 1 than 0) if and only if *Short* is true
and either *New* or *Known* is true.
Thus, the linear classifier learns the same
function as the decision tree learner. To see how this works, see the
"mail reading" example of the Neural `AISpace.org` applet.

This algorithm with the sigmoid function as the activation function
can learn any linearly separable classification in the sense that the
error can be made arbitrarily small on arbitrary sets of examples if, and only if, the target
classification is
linearly separable. A classification is
**linearly separable** if there exists a hyperplane where the
classification is true on one side of the hyperplane and false on the
other side. The hyperplane is defined as where the predicted value,
*f ^{w}(X_{1},...,X_{n}) = f(w_{0}+w_{1} ×val(e,X_{1}) +
...+ w_{n} ×val(e,X_{n}))*,
is 0.5. For the sigmoid activation function, this occurs when

*w*for the learned weights

_{0}+w_{1}×val(e,X_{1}) + ...+ w_{n}×val(e,X_{n})=0*w*. On one side of this hyperplane, the prediction is greater than 0.5; on the other side, it is less than 0.5.

Figure 7.8 shows linear separators for "or" and
"and". The dashed line separates the positive (true) cases from the
negative (false) cases.
One simple function that is not linearly separable is the
**exclusive-or** (xor) function, shown on the right. There is no
straight line that
separates the positive examples from the negative examples. As a result, a linear classifier cannot represent,
and therefore cannot learn, the exclusive-or function.

Often it is difficult to determine a priori whether a data set is linearly separable.

**Example 7.11:**Consider the data set of Figure 7.9, which is used to predict whether a person likes a holiday as a function of whether there is culture, whether the person has to fly, whether the destination is hot, whether there is music, and whether there is nature. In this data set, the value 1 means true and 0 means false. The linear classifier requires the numerical representation.

Culture | Fly | Hot | Music | Nature | Likes |

0 | 0 | 1 | 0 | 0 | 0 |

0 | 1 | 1 | 0 | 0 | 0 |

1 | 1 | 1 | 1 | 1 | 0 |

0 | 1 | 1 | 1 | 1 | 0 |

0 | 1 | 1 | 0 | 1 | 0 |

1 | 0 | 0 | 1 | 1 | 1 |

0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 1 | 1 | 1 |

1 | 1 | 1 | 0 | 0 | 0 |

1 | 1 | 0 | 1 | 1 | 1 |

1 | 1 | 0 | 0 | 0 | 1 |

1 | 0 | 1 | 0 | 1 | 1 |

0 | 0 | 0 | 1 | 0 | 0 |

1 | 0 | 1 | 1 | 0 | 0 |

1 | 1 | 1 | 1 | 0 | 0 |

1 | 0 | 0 | 1 | 0 | 0 |

1 | 1 | 1 | 0 | 1 | 0 |

0 | 0 | 0 | 0 | 1 | 1 |

0 | 1 | 0 | 0 | 0 | 1 |

After 10,000 iterations of gradient descent with a learning rate of 0.05, the optimal prediction is (to one decimal point)

Likes=f( 2.3×Culture +0.01×Fly-9.1×Hot -4.5×Music+6.8×Nature+0.01) ,

which approximately predicts the target value for all of the tuples in the training set except for the last and the third-to-last tuple, for which it predicts a value of about 0.5. This function seems to be quite stable with different initializations. Increasing the number of iterations makes it predict the other tuples better.

When the domain of the target variable is greater than 2 - there are more than two classes - indicator variables can be used to convert the classification to binary variables. These binary variables could be learned separately. However, the outputs of the individual classifiers must be combined to give a prediction for the target variable. Because exactly one of the values must be true for each example, the learner should not predict that more than one will be true or that none will be true. A classifier that predicts a probability distribution can normalize the predictions of the individual predictions. A learner that must make a definitive prediction can use the mode.