7.2 Supervised Learning

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

7.2.1 Evaluating Predictions

A point estimate for target feature Y on example e is a prediction of the value of Y(e). Let Y^(e) be the predicted value for target feature Y on example e. The error for this example on this feature is a measure of how close Y^(e) is to Y(e).

For regression, when the target feature Y is real valued, both Y^(e) and Y(e) are real numbers that can be compared arithmetically.

For classification, when the target feature Y is a discrete function, there are a number of alternatives:

  • When the domain of Y is binary, one value can be associated with 0, the other value with 1, and a prediction can be some real number. For Boolean features, with domain {false,true}, we associate 0 with false and 1 with true. The predicted value could be any real number or could be restricted to be 0 or 1. Here we assume that the prediction can be any real number, except where explicitly noted. The predicted and actual values can be compared numerically. There is nothing special about {0,1} for binary features; it is possible to use {-1,1} or to use zero and non-zero.

  • In a cardinal feature the values are mapped to real numbers. This is appropriate when values in the domain of Y are totally ordered, and the differences between the values are meaningful. In this case, the predicted and actual values can be compared on this scale.

    Often, mapping values to the real line is not appropriate even when the values are totally ordered; for example, suppose the values are short, medium, and long. The prediction that the value is “either short or long” is very different from the prediction that the value is medium. When the domain of a feature is totally ordered, but the differences between the values are not comparable, the feature is called an ordinal feature.

  • For a totally ordered feature, either cardinal or ordinal, and for a given value v, a Boolean feature can be constructed as a cut: a new feature that has value 1 when Yv and 0 otherwise. Which cut-values are used to construct features may be chosen according to the data or be selected a priori. Note that a cut for the maximal value in the domain, if there is one, is redundant as it is always true. It is also possible to construct a cut using less-than rather than less-than-or-equal to. Combining cuts allows for features that are true for intervals.

  • When Y is discrete with domain {v1,,vk}, where k>2, a separate prediction can be made for each vi. This can be modeled by having a binary indicator variable, Yi, associated with each value vi, where Yi(e)=1 if Y(e)=vi, and Yi(e)=0 otherwise. For each example, e, exactly one of Y1(e)Yk(e) will be 1 and the others will be 0. A prediction gives k real numbers – one real number for each Yi.

Example 7.3.

A trading agent wants to learn a person’s preference for the length of holidays. The holiday can be for 1, 2, 3, 4, 5, or 6 days.

One representation is to have a real-valued variable Y that is the number of days in the holiday.

Another representation is in terms of indicator variables, Y1,,Y6, where Yi represents the proposition that the person would like to stay for i days. For each example, Yi=1 when there are i days in the holiday, and Yi=0 otherwise.

The following are five data points using these two representations:

Example Y
e1 1
e2 6
e3 6
e4 2
e5 1
Example Y1 Y2 Y3 Y4 Y5 Y6
e1 1 0 0 0 0 0
e2 0 0 0 0 0 1
e3 0 0 0 0 0 1
e4 0 1 0 0 0 0
e5 1 0 0 0 0 0

A third representation is to have a binary cut feature of Yi for various values of i:

Example Y1 Y2 Y3 Y4 Y5
e1 1 1 1 1 1
e2 0 0 0 0 0
e3 0 0 0 0 0
e4 0 1 1 1 1
e5 1 1 1 1 1

A prediction for a new example e in the first representation can be any real number, such as Y^(e)=3.2.

In the second representation, the learner would predict a value for each Yi for each example. One such prediction may be, for each example e, predict Y1^(e)=0.5, Y2^(e)=0.3, Y3^(e)=0.1, Y4^(e)=0.1, Y5^(e)=0.1, and Y6^(e)=0.5. This is a prediction that the person may like 1 day or 6 days, but will like a stay of 3, 4, or 5 days much less.

In the third representation, the learner could predict a value for Yi for each value of i. One such prediction may be to predict Y1 with value 0.4, Yi with 0.6, for the other values for i. It is not rational to predict, for example, that Y2 but not Y4, because one implies the other.

In the following measures of prediction error, E is a set of examples and 𝐓 is a set of target features. For target feature Y𝐓 and example eE, the actual value is Y(e) and the predicted value is Y^(e).

  • The 0/1 error on E is the sum of the number of predictions that are wrong:

    eEY𝐓Y(e)Y^(e),

    where Y(e)Y^(e) is 0 when false, and 1 when true. This is the number of incorrect predictions. It does not take into account how wrong the predictions are, just whether they are correct or not.

  • The absolute error on E is the sum of the absolute differences between the actual and predicted values on each example:

    eEY𝐓|Y(e)-Y^(e)|.

    This is always non-negative, and is only zero when all the predictions exactly fit the observed values. Unlike for the 0/1 error, close predictions are better than far-away predictions.

  • The sum-of-squares error on E is

    eEY𝐓(Y(e)-Y^(e))2.

    This measure treats large errors as much worse than small errors. For example, an error of 2 on an example is as bad as 4 errors of 1, and an error of 10 on one example is as bad as 100 errors of 1. Minimizing sum-of-squares error is equivalent to minimizing the root-mean-square (RMS) error, obtained by dividing by the number of examples and taking the square root. Taking the square root and dividing by a constant do not affect which predictions are minimal.

  • The worst-case error on E is the maximum absolute difference:

    maxeEmaxY𝐓|Y(e)-Y^(e)|.

    In this case, the learner is evaluated by how bad it can be.

These are often described in terms of the norms of the difference between the predicted and actual values. The 0/1 error is the 𝐋𝟎 error, the absolute error is the 𝐋𝟏 error, the sum-of-squares error is the square of the 𝐋𝟐 error, and the worst-case error is the 𝐋 error. The sum-of-squares error is often written as L22, as the L2 norm takes the square root of the sum of squares. Taking square roots does not affect which value is the minimum. Note that the L0 error does not fit the mathematical definition of a norm.

Example 7.4.

Consider the data of Figure 7.2. Figure 7.3 shows a plot of the training data (filled circles) and three lines, L1, L2, and L, that predict the Y-value for all X points. L1 is the line that minimizes the absolute error, L2 is the line that minimizes the sum-of-squares error, and L minimizes the worst-case error of the training examples.

As no three points are collinear, any line through any pair of the points minimizes the 0/1, L0, error.

0

1

2

3

4

5

0

1

2

3

4

5

6

7

8

L1

L2

L

Figure 7.3: Linear regression predictions for a simple prediction example. Filled circles are the training examples. L1 is the prediction that minimizes the absolute error of the training examples. L2 is the prediction that minimizes the sum-of-squares error of the training examples. L is the prediction that minimizes the worst-case error of the training examples. See Example 7.4.

Lines L1 and L2 give similar predictions for X=1.1; namely, L1 predicts 1.805 and L2 predicts 1.709, whereas the data contain a data point (1.1,2.4). L predicts 0.7. They give predictions within 1.5 of each other when interpolating in the range [1,3]. Their predictions diverge when extrapolating from the data. L1 and L give very different predictions for X=5.

An outlier is an example that does not follow the pattern of the other examples. The difference between the lines that minimize the various error measures is most pronounced in how they handle outliers. The point (3.9,7) can be seen as an outlier as the other points are approximately in a line.

The prediction with the least worse-case error for this example, L, only depends on three data points, (1.1,2.4), (3.1,2.3), and (3.9,7), each of which has the same worst-case error for prediction L. The other data points could be at different locations, as long as they are not farther away from L than these three points.

The prediction that minimizes the absolute error, L1, does not change as a function of the actual Y-value of the training examples, as long as the points above the line stay above the line, and those below the line stay below. For example, the prediction that minimizes the absolute error would be the same, even if the last data point was (3.9,107) instead of (3.9,7).

Prediction L2 is sensitive to all of the data points; if the Y-value for any point changes, the line that minimizes the sum-of-squares error will change. Changes to outliers will have more effect on the line than changes to points close to the line.

For the special case where the domain of Y is {0,1}, and the prediction is in the range [0,1] (and so for Boolean domains where true is treated as 1, and false as 0), the following can also be used to evaluate predictions:

  • The likelihood of the data is the probability of the data when the predicted value is interpreted as a probability, and each of the examples are predicted independently:

    eEY𝐓Y^(e)Y(e)(1-Y^(e))(1-Y(e)).

    One of Y(e) and (1-Y(e)) is 1, and the other is 0. Thus, this product uses Y^(e) when Y(e)=1 and (1-Y^(e)) when Y(e)=0. A better prediction is one with a higher likelihood. The model with the greatest likelihood is the maximum likelihood model.

  • The log-likelihood is the logarithm of the likelihood, which is:

    eEY𝐓(Y(e)logY^(e)+(1-Y(e))log(1-Y^(e))).

    A better prediction is one with a higher log-likelihood. To make this into an error term to minimize, the log loss is the negative of the log-likelihood divided by the number of examples.

    The log loss is closely related to the notion of entropy. The log loss can be seen as the average number of bits it will take to encode the data given a code that is based on Y^(e) treated as a probability.

Information Theory

A bit is a binary digit. Because a bit has two possible values (0 and 1), it can be used to distinguish two items. Two bits can distinguish four items, each associated with either 00, 01, 10, or 11. In general, n bits can distinguish 2n items. Thus, we can distinguish n items with log2n bits. It may be surprising, but we can do better than this using probabilities.

Consider this code to distinguish the elements of the set {a,b,c,d}, with P(a)=12, P(b)=14, P(c)=18, and P(d)=18:

a0c110b10d111

This code sometimes uses 1 bit, sometimes 2 bits and sometimes uses 3 bits. On average, it uses

P(a)*1+P(b)*2+P(c)*3+P(d)*3=12+24+38+38=134 bits.

For example, the string aacabbda with 8 characters has code 00110010101110, which uses 14 bits.

With this code, -log2P(a)=1 bit is required to distinguish a from the other symbols. Distinguishing b uses -log2P(b)=2 bits. Distinguishing c or d requires -log2P(c)=3 bits.

It is possible to build a code that, to identify x, requires -log2P(x) bits (or the integer greater than this). Suppose there is a sequence of symbols we want to transmit or store and we know the probability distribution over the symbols. A symbol x with probability P(x) can use -log2P(x) bits. To transmit a sequence, each symbol requires, on average,

x-P(x)*log2P(x)

bits to send it. This is called the information content or entropy of the distribution. This value just depends on the probability distribution of the symbols.

Analogous to conditioning in probability, the expected number of bits it takes to describe a distribution for x given evidence e is

x-P(xe)*log2P(xe).

For a test that can distinguish the cases where α is true from the cases where α is false, the expected information after the test is

-P(α)*xP(xα)*log2P(xα)-P(¬α)*xP(x¬α)*log2P(x¬α).

The entropy of the distribution minus the expected information after the test is called the information gain of the test.

Example 7.5.

Consider the length of holiday data of Example 7.3. Suppose there are no input features, so all of the examples get the same prediction.

In the first representation, the prediction that minimizes the sum of absolute errors on the training data presented in Example 7.3 is 2, with an error of 10. The prediction that minimizes the sum-of-squares error on the training data is 3.2. The prediction the minimizes the worst-case error is 3.5.

For the second representation, the prediction that minimizes the sum of absolute errors for the training examples is to predict 0 for each Yi. The prediction that minimizes the sum-of-squares error for the training examples is Y1=0.4, Y2=0.1, Y3=0, Y4=0, Y5=0, and Y6=0.4. This is also the prediction that maximizes the likelihood of the training data. The prediction that minimizes the worst-case error for the training examples is to predict 0.5 for Y1, Y2, and Y6 and to predict 0 for the other features.

Thus, which prediction is preferred depends on how the prediction is represented and how it will be evaluated.