7.3.2 Linear Regression and Classification

Linear functions provide the basis for many learning algorithms. In this section, we first cover regression - the problem of predicting a real-valued function from training examples. Then we consider the discrete case of classification.

Linear regression is the problem of fitting a linear function to a set of input-output pairs given a set of training examples, in which the input and output features are numeric.

Suppose the input features are X1,...,Xn. A linear function of these features is a function of the form

fw(X1,...,Xn) = w0+w1 ×X1 + ...+ wn ×Xn ,

where w=⟨w0,w1,...,wn is a tuple of weights. To make w0 not be a special case, we invent a new feature, X0, whose value is always 1.

We will learn a function for each target feature independently, so we consider only one target, Y. Suppose a set E of examples exists, where each example e∈E has values val(e,Xi) for feature Xi and has an observed value val(e,Y). The predicted value is thus

pvalw(e,Y)=w0+w1×val(e,X1) + ...+ wn×val(e,Xn)
=∑i=0n wi×val(e,Xi) ,

where we have made it explicit that the prediction depends on the weights, and where val(e,X0) is defined to be 1.

The sum-of-squares error on examples E for target Y is

ErrorE(w) = ∑e∈E(val(e,Y)-pvalw(e,Y))2
= ∑e∈E (val(e,Y)-∑i=0n wi×val(e,Xi))2.

In this linear case, the weights that minimize the error can be computed analytically [see Exercise 7.5]. A more general approach, which can be used for wider classes of functions, is to compute the weights iteratively.

Gradient descent is an iterative method to find the minimum of a function. Gradient descent starts with an initial set of weights; in each step, it decreases each weight in proportion to its partial derivative:

wi ←wi - η×(∂ErrorE(w))/(∂wi)

where η, the gradient descent step size, is called the learning rate. The learning rate, as well as the features and the data, is given as input to the learning algorithm. The partial derivative specifies how much a small change in the weight would change the error.

Consider minimizing the sum-of-squares error. The error is a sum over the examples. The partial derivative of a sum is the sum of the partial derivatives. Thus, we can consider each example separately and consider how much it changes the weights. The error with respect to example e has a partial derivative with respect to weight of wi of -2×[val(e,Y)-pvalw(e,Y)]×val(e,Xi). For each example e, let δ=val(e,Y)-pvalw(e,Y). Thus, each example e updates each weight wi:


where we have ignored the constant 2, because we assume it is absorbed into the constant η.

1: Procedure LinearLearner(X,Y,E,η)
2:           Inputs
3:                     X: set of input features, X={X1,...,Xn}
4:                     Y: target feature
5:                     E: set of examples from which to learn
6:                     η: learning rate
7:           Output
8:                     parameters w0,...,wn
9:           Local
10:                     w0,...,wn: real numbers
11:                     pvalw(e,Y)=w0+w1×val(e,X1) + ...+ wn ×val(e,Xn)
12:           initialize w0,...,wn randomly
13:           repeat
14:                     for each example e in E do
15:                               δ←val(e,Y)-pvalw(e,Y)
16:                               for each i∈[0,n] do
17:                                         wi←wi+η×δ×val(e,Xi)
20:           until termination
21:           return w0,...,wn
Figure 7.6: Gradient descent for learning a linear function

Figure 7.6 gives an algorithm, LinearLearner(X,Y,E,η), for learning a linear function for minimizing the sum-of-squares error. Note that, in line 17, val(e,X0) is 1 for all e. Termination is usually after some number of steps, when the error is small or when the changes get small.

The algorithm presented in Figure 7.6 is sometimes called incremental gradient descent because of the weight change while it iterates through the examples. An alternative is to save the weights at each iteration of the while loop, use the saved weights for computing the function, and then update these saved weights after all of the examples. This process computes the true derivative of the error function, but it is more complicated and often does not work as well.

The same algorithm can be used for other error functions. For the absolute error, which is not actually differentiable at zero, the derivative can be defined to be zero at that point because the error is already at a minimum and the parameters do not have to change. See Exercise 7.12.