### 7.2.1 Evaluating Predictions

If *e* is an example, a **point estimate** for target feature
*Y* is a prediction of a particular value for *Y* on *e*. Let *pval(e,Y)* be
the predicted value for target feature *Y* on example *e*. The **error**
for this example on this feature is a measure of how close *pval(e,Y)*
is to *val(e,Y)*, where *val(e,Y)* is the actual value for feature *Y*
in *e*.

For regression, when the target feature *Y* is real valued, both *pval(e,Y)* and
*val(e,Y)* are real numbers that can be compared arithmetically.

For classification, when the target feature *Y* is a discrete variable, a number
of alternatives exist:

- When
*Y*is binary, one value can be associated with 0, the other value with 1, and a prediction can be some real number. The predicted and actual values can be compared numerically. - When the domain of
*Y*has more than two values, sometimes the values are totally ordered and can be scaled so a real number can be associated with each value of the domain of*Y*. In this case, the predicted and actual values can be compared on this scale. Often, this is not appropriate even when the values are totally ordered; for example, suppose the values are*short*,*medium*, and*long*. The prediction that the value is*short ∨ long*is very different from the prediction that the value is*medium*. - When the domain of
*Y*is*{v*, where_{1},...,v_{k}}*k>2*, a separate prediction can be made for each*v*. This can be modeled by having a binary_{i}**indicator variable**associated with each*v*which, for each example, has value 1 when the example has value_{i}*v*and the indicator variable has value 0 otherwise. For each training example, exactly one of the indicator variables associated with_{i}*Y*will be 1 and the others will be 0. A prediction gives*k*real numbers - one real number for each*v*._{i}

**Example 7.2:**Suppose the trading agent wants to learn a person's preference for the length of holidays. Suppose the holiday can be for 1, 2, 3, 4, 5, or 6 days.

One representation is to have a real-valued variable *Y* that is the number
of days in the holiday.

Another representation is to have six real-valued variables, *Y _{1},...,Y_{6}*,
where

*Y*represents the proposition that the person would like to stay for

_{i}*i*days. For each example,

*Y*when there are

_{i}=1*i*days in the holiday, and

*Y*otherwise.

_{i}=0The following is a sample of five data points using the two representations:

Example | Y |

e _{1} | 1 |

e _{2} | 6 |

e _{3} | 6 |

e _{4} | 2 |

e _{5} | 1 |

Example | Y _{1} | Y _{2} | Y _{3} | Y _{4} | Y _{5} | Y _{6} |

e _{1} | 1 | 0 | 0 | 0 | 0 | 0 |

e _{2} | 0 | 0 | 0 | 0 | 0 | 1 |

e _{3} | 0 | 0 | 0 | 0 | 0 | 1 |

e _{4} | 0 | 1 | 0 | 0 | 0 | 0 |

e _{5} | 1 | 0 | 0 | 0 | 0 | 0 |

A prediction for a new example in the first representation can be any
real number, such as *Y=3.2*.

In the second representation, the learner would predict a value for
each *Y _{i}* for each example. One such prediction may be

*Y*,

_{1}=0.5*Y*,

_{2}=0.3*Y*,

_{3}=0.1*Y*,

_{4}=0.1*Y*, and

_{5}=0.1*Y*. This is a prediction that the person may like 1 day or 6 days, but will not like a stay of 3, 4, or 5 days.

_{6}=0.5In the following definitions, *E* is the set of all examples and * T* is the set of
target features.

There are a number of prediction measures that can be defined:

- The
**absolute error**on*E*is the sum of the absolute errors of the predictions on each example. That is,*∑*_{e∈E}∑_{Y∈T}|val(e,Y)-pval(e,Y)|.This is always non-negative, and is only zero when the predictions exactly fit the observed values.

- The
**sum-of-squares error**on*E*is*∑*_{e∈E}∑_{Y∈T}(val(e,Y)-pval(e,Y))^{2}.This measure treats large errors as worse than small errors. An error twice as big is four times as bad, and an error 10 times as big is 100 times worse.

- The
**worst-case error**on*E*is the maximum absolute error:*max*_{e∈E}max_{Y∈T}|val(e,Y)-pval(e,Y)|.In this case, the learner is evaluated by how bad it can be.

**Example 7.3:**Suppose there is a real-valued target feature,

*Y*, that is based on a single real-valued input feature,

*X*. Suppose the data contains the following

*(X,Y)*points:

(0.7,1.7),(1.1,2.4),(1.3,2.5),(1.9,1.7),(2.6,2.1),(3.1,2.3),(3.9,7).

Figure 7.2 shows a plot of the training data (filled
circles) and three lines, *P _{1}*,

*P*, and

_{2}*P*, that predict the

_{3}*Y*-value for all

*X*points.

*P*is the line that minimizes the absolute error,

_{1}*P*is the line that minimizes the sum-of-squares error, and

_{2}*P*minimizes the worst-case error of the training examples.

_{3}Lines *P _{1}* and

*P*give similar predictions for

_{2}*X=1.1*; namely,

*P*predicts 1.805 and

_{1}*P*predicts 1.709, whereas the data contain a data point

_{2}*(1.1,2.4)*.

*P*predicts 0.7. They give predictions within 1.5 of each other when interpolating in the range

_{3}*[1,3]*. Their predictions diverge when extrapolating from the data.

*P*and

_{1}*P*give very different predictions for

_{3}*X=10*.

The difference between the lines that minimize the various error
measures is most pronounced in how they handle
the outlier examples, in this case the point *(3.9,7)*. The other
points are approximately in a line.

The prediction with the least worse-case error for this example,
*P _{3}*, only depends on three data points,

*(1.1,2.4)*,

*(3.1,2.3)*, and

*(3.9,7)*, each of which has the same worst-case error for prediction

*P*. The other data points could be at different locations, as long as they are not farther away from

_{3}*P*than these three points.

_{3}In contrast, the prediction that minimizes the absolute error, *P _{1}*,
does not change as a function of the actual

*Y*-value of the training examples, as long as the points above the line stay above the line, and those below the line stay below. For example, the prediction that minimizes the absolute error would be the same, even if the last data point was

*(3.9,107)*instead of

*(3.9,7)*.

Prediction *P _{2}* is sensitive to all of the data points; if the

*Y*-value for any point changes, the line that minimizes the sum-of-squares error will change.

There are a number of prediction measures that can be used for the
special case where the domain of *Y* is *{0,1}*, and the prediction
is in the range *[0,1]*. These measures can be used for Boolean
domains where *true* is treated as 1, and *false* is treated as 0.

- The
**likelihood of the data**is the probability of the data when the predicted value is interpreted as a probability:*∏*_{e∈E}∏_{Y∈T}pval(e,Y)^{val(e,Y)}(1-pval(e,Y))^{(1-val(e,Y))}.One of

*val(e,Y)*and*(1-val(e,Y))*is 1, and the other is 0. Thus, this product uses*pval(e,Y)*when*val(e,Y)=1*and*(1-pval(e,Y))*when*val(e,Y)=0*. A better prediction is one with a higher likelihood. The model with the greatest likelihood is the**maximum likelihood model**. - The
**entropy**of the data is the number of bits it will take to encode the data given a code that is based on*pval(e,Y)*treated as a probability. The entropy is*-∑*_{e∈E}∑_{Y∈T}[val(e,Y) log pval(e,Y) + (1-val(e,Y)) log (1-pval(e,Y))].A better prediction is one with a lower entropy.

A prediction that minimizes the entropy is a prediction that maximizes the likelihood. This is because the entropy is the negative of the logarithm of the likelihood.

- Suppose the predictions are
also restricted to be
*{0,1}*. A**false-positive error**is a positive prediction that is wrong (i.e., the predicted value is*1*, and the actual value is*0*). A**false-negative error**is a negative prediction that is wrong (i.e., the predicted value is*0*, and the actual value is*1*). Often different costs are associated with the different sorts of errors. For example, if there are data about whether a product is safe, there may be different costs for claiming it is safe when it is not safe, and for claiming it is not safe when it is safe.We can separate the question of whether the agent has a good learning algorithm from whether it makes good predictions based on preferences that are outside of the learner. The predicting agent can at one extreme choose to only claim a positive prediction when it is sure the prediction is positive. At the other extreme, it can claim a positive prediction unless it is sure the prediction should be negative. It can often make predictions between these extremes.

One way to test the prediction independently of the decision is to consider the four cases between the predicted value and the actual value:

actual positive actual negative predict positive true positive ( *tp*)false positive ( *fp*)predict negative false negative ( *fn*)true negative ( *tn*)Suppose

*tp*is the number of true positives,*fp*is the number of false positives,*fn*is the number of false negatives, and*tn*is the number of true negatives. The**precision**is*(tp)/(tp+fp)*, which is the proportion of positive predictions that are actual positives. The**recall**or**true-positive rate**is*(tp)/(tp+fn)*, which is the proportion of actual positives that are predicted to be positive. The**false-positive error**rate is*(fp)/(fp+tn)*, which is the proportion of actual negatives predicted to be positive.An agent should try to maximize precision and recall and to minimize the false-positive rate; however, these goals are incompatible. An agent can maximize precision and minimize the false-positive rate by only making positive predictions it is sure about. However, this choice worsens recall. To maximize recall, an agent can be risky in making predictions, which makes precision smaller and the false-positive rate larger. The predicting agent often has parameters that can vary a threshold of when to make positive predictions. A

**precision-recall curve**plots the precision against the recall as these parameters change. An**ROC curve**, or receiver operating characteristic curve, plots the false-positive rate against the false-negative rate as this parameter changes. Each of these approaches may be used to compare learning algorithms independently of the actual claim of the agent. - The prediction can be seen as an action of the predicting agent. The agent should choose the action that maximizes a preference function that involves a trade-off among the costs associated with its actions. The actions may be more than true or false, but may be more complex, such as "proceed with caution" or "definitely true." What an agent should do when faced with uncertainty is discussed in Chapter 9.

**Example 7.4:**Consider the data of Example 7.2. Suppose there are no input features, so all of the examples get the same prediction.

In the first representation, the prediction that minimizes the sum of absolute errors on the training data presented in Example 7.2 is 2, with an error of 10. The prediction that minimizes the sum-of-squares error on the training data is 3.2. The prediction the minimizes the worst-case error is 3.5.

For the second representation, the prediction that minimizes the sum
of absolute errors for the training examples is to predict 0 for each
*Y _{i}*. The prediction that minimizes the sum-of-squares error for the
training examples is

*Y*,

_{1}=0.4*Y*,

_{2}=0.1*Y*,

_{3}=0*Y*,

_{4}=0*Y*, and

_{5}=0*Y*. This is also the prediction that minimizes the entropy and maximizes the likelihood of the training data. The prediction that minimizes the worst-case error for the training examples is to predict 0.5 for

_{6}=0.4*Y*,

_{1}*Y*, and

_{2}*Y*and to predict

_{6}*0*for the other features.

Thus, whichever prediction is preferred depends on how the prediction will be evaluated.