Artificial Intelligence - foundations of computational agents -- 7.2.2 Point Estimates with No Input Features

Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

7.2.2 Point Estimates with No Input Features

The simplest case for learning is when there are no input features and where there is a single target feature. This is the base case for many of the learning algorithms and corresponds to the case where all inputs are ignored. In this case, a learning algorithm predicts a single value for the target feature for all of the examples. The prediction that minimizes the error depends on the error that is being minimized.

Suppose E is a set of examples and Y is a numeric feature. The best an agent can do is to make a single point estimate for all examples. Note that it is possible for the agent to make stochastic predictions, but these are not better; see Exercise 7.2.

The sum-of-squares error on E of prediction v is

∑_e∈E (val(e,Y)-v)².

The absolute error on E of prediction v is

∑_e∈E |val(e,Y)-v|.

The worst-case error on E of prediction v is

max_e∈E |val(e,Y)-v|.

Proposition. Suppose V is the multiset of values of val(e,Y) for e∈E.
The prediction that minimizes the sum-of-squares error on E is the mean of V (the average value).
The value that minimizes the absolute error is the median of V. In particular, any number v such that there is the same number of values of V less than v as there are values greater than v minimizes the error.
The value that minimizes the worst-case error is (max+min)/2, where max is the maximum value and min is the minimum value.

Proof. The details of the proof are left as an exercise. The basic idea follows:
Differentiate the formula for the sum-of-squares error with respect to v and set to zero. This is elementary calculus. To make sure the point(s) with a derivative of zero is(are) a minimum, the end points also must be checked.
The absolute error is a piecewise linear function of v. The slope for a value that is not in V depends on the number of elements greater minus the number of elements less than that value: v is a minimum if there are the same number of elements greater than v as there are less than v.
This prediction has an error of (max-min)/2; increasing or decreasing the prediction will increase the error.

When the target feature has domain {0,1}, the training examples can be summarized in two numbers: n₀, the number of examples with the value 0, and n₁, the number of examples with value 1. The prediction for each new case is the same number, p.

The optimal prediction p depends on the optimality criteria. The value of the optimality criteria for the training examples can be computed analytically and can be optimized analytically. The results are summarized in Figure 7.3.

Prediction

Measure of

Optimal

measure

prediction p for

prediction for

the training data

training data

absolute error

n₀ p+n₁(1-p)

median(n₀,n₁)

sum squares

n₀ p²+n₁(1-p)²

(n₁)/(n₀+n₁)

worst case

{
p if n₁=0

1-p if n₀=0

max(p,1-p) otherwise

{
0 if n₁=0

1 if n₀=0

0.5 otherwise

likelihood

p^n₁(1-p)^n₀

(n₁)/(n₀+n₁)

entropy

-n₁ log p-n₀ log (1-p)

(n₁)/(n₀+n₁)

Figure 7.3: Optimal prediction for binary classification where the training data consist of n₀ examples of 0 and n₁ examples of 1, with no input features. median(n₀,n₁) is 0 if n₀>n₁, 1 if n₀<n₁, and any value in [0,1] if n₀=n₁.

Notice that optimizing the absolute error means predicting the median, which in this case is also the mode; this should not be surprising because the error is linear in p.

The optimal prediction for the training data for the other criteria is to predict the empirical frequency: the proportion of 1's in the training data, namely (n₁)/(n₀+n₁). This can be seen as a prediction of the probability. The empirical frequency is often called the maximum-likelihood estimate.

This analysis does not specify the optimal prediction for the test data. We would not expect the empirical frequency of the training data to be the optimal prediction for the test data for maximizing the likelihood or minimizing the entropy. If n₀=0 or if n₁=0, all of the training data are classified the same. However, if just one of the test examples is not classified in this way, the likelihood would be 0 (its lowest possible value) and the entropy would be infinite. See Exercise 7.1.