7.2 Supervised Learning

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

7.2.3 Point Estimates with No Input Features

The simplest case for learning is when there are no input features and where there is a single target feature. This is the base case for many of the learning algorithms and corresponds to the case where all inputs are ignored. In this case, a learning algorithm predicts a single value for the target feature for all of the examples. The prediction that minimizes the error depends on the error that is being minimized.

Suppose E is a set of examples and Y is a numeric feature. The best an agent can do is to make a single point estimate, v, for all examples. The errors for this case are

The 0/1 error on E of prediction v is eEY(e)v.

The sum-of-squares error on E of prediction v is eE(Y(e)-v)2.

The absolute error on E of prediction v is eE|Y(e)-v|.

The worst-case error on E of prediction v is maxeE|Y(e)-v|.

Proposition 7.1.

Suppose V is the multiset of values of Y(e) for eE.

  1. 1.

    A prediction that minimizes the 0/1 error is a mode; one of the values that appears most often. When there are multiple modes, any can be chosen.

  2. 2.

    The prediction that minimizes the sum-of-squares error on E is the mean of V (the average value).

  3. 3.

    The absolute error is minimized by any median of V. A median is the middle number when the values are sorted; a number m such that half or more of values of V less than or equal to m and half or more are greater than or equal to m. For example, for the numbers {3,4,6,17}, any number between 4 and 6 is a median.

  4. 4.

    The value that minimizes the worst-case error is (max+min)/2, where max is the maximum value and min is the minimum value.

Proof.

The details of the proof are left as an exercise. The proof sketch is as follows.

  1. 1.

    This should be obvious.

  2. 2.

    Differentiate the formula for the sum-of-squares error with respect to v and set to zero. This is elementary calculus.

  3. 3.

    The absolute error is a piecewise linear function of v. The slope for a value that is not in V depends on the number of elements greater minus the number of elements less than that value: v is a minimum if there are the same number of elements greater than v as there are less than v.

  4. 4.

    This prediction has a worst-case error of (max-min)/2; increasing or decreasing the prediction will increase the error.

When the target feature has domain {0,1}, the training examples can be summarized in two numbers: n0, the number of examples with the value 0, and n1, the number of examples with value 1. The prediction for each new case is the same number, p.

The optimal prediction p depends on the optimality criteria. The value of the optimality criteria for the training examples can be computed analytically and can be optimized analytically. The results are summarized in Figure 7.5.

Prediction Measure of Optimal
measure prediction p for prediction for
the training data training data
0/1 error n1 if p=1 else n0 if p=0 0 if n0>n1 else 1
absolute error n0p+n1(1-p) 0 if n0>n1 else 1
sum squares n0p2+n1(1-p)2 n1n0+n1
worst case {p if n1=01-p if n0=0max(p,1-p) otherwise {0 if n1=01 if n0=00.5 otherwise
likelihood pn1(1-p)n0 n1n0+n1
log-likelihood n1logp+n0log(1-p) n1n0+n1
Figure 7.5: Optimal prediction for binary classification where the training data consist of n0 examples of 0 and n1 examples of 1, with no input features

Notice that optimizing the absolute error means predicting the median, which in this case is also the mode. The error is linear in p, and so has a minimum at one of the ends.

The optimal prediction for the training data for the other criteria is to predict the empirical frequency: the proportion of 1s in the training data, namely n1n0+n1. This can be seen as a prediction of the probability. The empirical frequency is often called the maximum-likelihood estimate.

This analysis does not specify the optimal prediction for the test data. We would not expect the empirical frequency of the training data to be the optimal prediction for the test data for maximizing the likelihood or minimizing the entropy. If n0=0 or if n1=0, all of the training data are classified the same. However, if just one of the test examples is not classified in this way, the likelihood would be 0 (its lowest possible value) and the entropy would be infinite. This is an example of overfitting. See Exercise 1.