### 7.3.2 Linear Regression and Classification

Linear functions provide the basis for many learning algorithms. In this section, we first cover regression - the problem of predicting a real-valued function from training examples. Then we consider the discrete case of classification.

**Linear
regression** is the
problem of fitting a linear function to a set of input-output pairs
given a set of training examples, in which the input and
output features are numeric.

Suppose the input features are *X _{1},...,X_{n}*. A

**linear function**of these features is a function of the form

f^{w}(X_{1},...,X_{n}) = w_{0}+w_{1}×X_{1}+ ...+ w_{n}×X_{n},

where *w=⟨w _{0},w_{1},...,w_{n}⟩* is a tuple of
weights. To make

*w*not be a special case, we invent a new feature,

_{0}*X*, whose value is always 1.

_{0}We will learn a function for each target feature independently, so we
consider only one target, *Y*.
Suppose a set *E* of examples exists, where each example *e∈E* has
values *val(e,X _{i})* for feature

*X*and has an observed value

_{i}*val(e,Y)*. The predicted value is thus

pval ^{w}(e,Y)=w _{0}+w_{1}×val(e,X_{1}) + ...+ w_{n}×val(e,X_{n})=∑ _{i=0}^{n}w_{i}×val(e,X_{i}) ,

where we have made it explicit that the prediction depends on the
weights, and where *val(e,X _{0})* is defined to be 1.

The sum-of-squares error on examples *E* for target *Y* is

Error _{E}(w)= ∑ _{e∈E}(val(e,Y)-pval^{w}(e,Y))^{2}= ∑ _{e∈E}(val(e,Y)-∑_{i=0}^{n}w_{i}×val(e,X_{i}))^{2}.

In this linear case, the weights that minimize the error can be computed analytically [see Exercise 7.5]. A more general approach, which can be used for wider classes of functions, is to compute the weights iteratively.

**Gradient descent** is an iterative
method to find the minimum of a function. Gradient descent starts
with an initial set of weights; in each step, it decreases each
weight in proportion to its partial derivative:

w_{i}←w_{i}- η×(∂Error_{E}(w))/(∂w_{i})

where *η*, the gradient descent step size, is called
the **learning rate**. The learning rate, as well
as the features and the data, is given as input to the learning
algorithm.
The partial derivative
specifies how much a small change in
the weight would change the error.

Consider minimizing the
sum-of-squares error.
The error is a sum over the examples. The partial derivative of a sum is the sum of the partial
derivatives. Thus, we can consider each example separately and
consider how much it changes the weights. The error with respect to example *e* has a partial
derivative with respect to weight of *w _{i}* of

*-2×[val(e,Y)-pval*. For each example

^{w}(e,Y)]×val(e,X_{i})*e*, let

*δ=val(e,Y)-pval*. Thus, each example

^{w}(e,Y)*e*updates each weight

*w*:

_{i}

w _{i}← w _{i}+η×δ×val(e,X_{i}),

where we have ignored the constant 2, because we assume it is
absorbed into the constant *η*.

**Procedure**LinearLearner(

*X,Y,E,η*)

2:

**Inputs**

3:

*X*: set of input features,

*X={X*

_{1},...,X_{n}}4:

*Y*: target feature

5:

*E*: set of examples from which to learn

6:

*η*: learning rate

7:

**Output**

8: parameters

*w*

_{0},...,w_{n}9:

**Local**

10:

*w*: real numbers

_{0},...,w_{n}11:

*pval*

^{w}(e,Y)=w_{0}+w_{1}×val(e,X_{1}) + ...+ w_{n}×val(e,X_{n})12: initialize

*w*randomly

_{0},...,w_{n}13:

**repeat**

14:

**for each**example

*e*in

*E*

**do**

15:

*δ←val(e,Y)-pval*

^{w}(e,Y)16:

**for each**

*i∈[0,n]*

**do**

17:

*w*

_{i}←w_{i}+η×δ×val(e,X_{i})18:

19:

20:

**until**termination

21:

**return**

*w*

_{0},...,w_{n}Figure 7.6 gives an algorithm, *LinearLearner(X,Y,E,η)*, for learning a linear
function for minimizing the sum-of-squares error. Note that, in
line 17, *val(e,X _{0})* is 1 for all

*e*. Termination is usually after some number of steps, when the error is small or when the changes get small.

The algorithm presented in Figure 7.6 is sometimes
called **incremental gradient descent** because of the weight change
while it iterates through the examples. An alternative is to save the
weights at each iteration of the while loop, use the saved weights for
computing the function, and then update these saved weights after
all of the examples. This process computes the true derivative of the
error function, but it is more complicated and often does not work as well.

The same algorithm can be used for other error functions. For the absolute error, which is not actually differentiable at zero, the derivative can be defined to be zero at that point because the error is already at a minimum and the parameters do not have to change. See Exercise 7.12.