7 Supervised Machine Learning 7.1 Learning Issues 7.3 Basic Models for Supervised Learning

7.2 Supervised Learning Foundations

A common learning task is supervised learning, where there is a set of examples, described by features, which are partitioned into input features and target features. The aim is to predict the values of the target features from the values of the input features for each possible example.

A feature is a function from examples into a value. If $e$ is an example, and $F$ is a feature, ${F}({e})$ is the value of feature $F$ for example $e$ . The domain of a feature is the set of values it can return. Note that this is the range of the function defining the feature, but becomes the domain of a function that takes features and makes predictions. A Boolean feature is one with domain $\{false,true\}$ .

A feature has exactly one value for each example. Suppose an example is a news article. The topic of an article can be a feature, with domain the set of possible topics (perhaps including “none” if it is possible there is no topic), if each article has a unique topic. If each article can have multiple topics, topic would not be a feature. Instead, there could be a Boolean feature for each possible topic. Alternatively, there could be a feature that maps each news article to the set of topics it covers.

In a supervised learning task, the learner is given

•

a set of input features, $X_{1},\dots,X_{m}$
•

a set of target features, $Y_{1},\dots,Y_{k}$
•

a bag of training examples, where each example $e$ is a pair $(x_{e},y_{e})$ , where $x_{e}=(X_{1}(e),\dots,X_{m}(e))$ is a tuple of a value for each input feature and $y_{e}=(Y_{1}(e),\dots,Y_{k}(e))$ is a tuple of a value for each target feature.

The output is a predictor, a function that predicts $Y$ s from $X$ s. The aim is to predict the values of the target features for examples that the learner has not seen. For this book, consider only a single target feature, except where explicitly noted.

Supervised learning is called regression when the domain of the target is (a subset of) the real numbers. It is called classification when the domain of the target is a fixed finite set, for example, the domain could be Boolean $\{false,true\}$ , clothing sizes $\{$ XS, S, M, L, XL, $\dots\}$ , or countries of birth (countries, or None for those people who were born outside of any country). Other forms of supervised learning include relational learning, such as predicting a person’s birth mother when test examples might be from a different population of people from training examples, and structured prediction, such as predicting the shape of a molecule.

Some examples of applications of supervised learning are

•

a smart watch predicting the activity of the wearer (e.g., sleeping, sitting, walking, running, driving) from their heart rate and movement
•

predicting how likely a location will be to flood in the next decade or century, based on the local topography, climate, geology, and land use, which might be used for insurance or for a decision about whether to build in the location
•

predicting the words that someone may have hand-written in a phone or tablet based on their writing, perhaps taking the order of the strokes into account
•

in a machine translation system, predicting Indonesian text that corresponds to some Swahili text (where the training examples might include text in other languages).

Before tackling sophisticated applications such as these, you need to understand the basics.

$E x a m p l e$	$A u t h o r$	$T h r e a d$	$L e n g t h$	$Where\_read$	$User\_action$
$e_{1}$	$k n o w n$	$n e w$	$l o n g$	$h o m e$	$s k i p s$
$e_{2}$	$u n k n o w n$	$n e w$	$s h o r t$	$w o r k$	$r e a d s$
$e_{3}$	$u n k n o w n$	$f o l l o w u p$	$l o n g$	$w o r k$	$s k i p s$
$e_{4}$	$k n o w n$	$f o l l o w u p$	$l o n g$	$h o m e$	$s k i p s$
$e_{5}$	$k n o w n$	$n e w$	$s h o r t$	$h o m e$	$r e a d s$
$e_{6}$	$k n o w n$	$f o l l o w u p$	$l o n g$	$w o r k$	$s k i p s$
$e_{7}$	$u n k n o w n$	$f o l l o w u p$	$s h o r t$	$w o r k$	$s k i p s$
$e_{8}$	$u n k n o w n$	$n e w$	$s h o r t$	$w o r k$	$r e a d s$
$e_{9}$	$k n o w n$	$f o l l o w u p$	$l o n g$	$h o m e$	$s k i p s$
$e_{10}$	$k n o w n$	$n e w$	$l o n g$	$w o r k$	$s k i p s$
$e_{11}$	$u n k n o w n$	$f o l l o w u p$	$s h o r t$	$h o m e$	$s k i p s$
$e_{12}$	$k n o w n$	$n e w$	$l o n g$	$w o r k$	$s k i p s$
$e_{13}$	$k n o w n$	$f o l l o w u p$	$s h o r t$	$h o m e$	$r e a d s$
$e_{14}$	$k n o w n$	$n e w$	$s h o r t$	$w o r k$	$r e a d s$
$e_{15}$	$k n o w n$	$n e w$	$s h o r t$	$h o m e$	$r e a d s$
$e_{16}$	$k n o w n$	$f o l l o w u p$	$s h o r t$	$w o r k$	$r e a d s$
$e_{17}$	$k n o w n$	$n e w$	$s h o r t$	$h o m e$	$r e a d s$
$e_{18}$	$u n k n o w n$	$n e w$	$s h o r t$	$w o r k$	$r e a d s$
$e_{19}$	$u n k n o w n$	$n e w$	$l o n g$	$w o r k$	$?$
$e_{20}$	$u n k n o w n$	$f o l l o w u p$	$s h o r t$	$h o m e$	$?$

Figure 7.1: Example data of a user’s behavior. These are fictitious examples obtained from observing a user deciding whether to read articles posted to a threaded discussion website depending on whether the author is known or not, whether the article started a new thread or was a follow-up, the length of the article, and whether it is read at home or at work.

e_{1},\dots,e_{18}

are the training examples. The aim is to make a prediction for the user action on

e_{19}

e_{20}

, and other, currently unseen, examples

Example 7.1.

Figure 7.1 shows training examples typical of a classification task. The aim is to predict whether a person reads an article posted to a threaded discussion website given properties of the article. The input features are $A u t h o r$ , $T h r e a d$ , $L e n g t h$ , and $Where\_read$ . There is one target feature, $User\_action$ . The domain of $A u t h o r$ is $\{known,unknown\}$ , the domain of $T h r e a d$ is $\{new,followup\}$ , and so on.

There are eighteen training examples, each of which has a value for all of the features. In this dataset, ${Author}({e_{11}}){\,{=}\,}unknown$ , ${Thread}({e_{11}}){\,{=}\,}followup$ , and ${User\_action}({e_{11}}){\,{=}\,}skips$ .

There are two new cases, $e_{19}$ and $e_{20}$ , for which the model needs to predict the user action.

Example 7.2.

Example	$X$	$Y$
$e_{1}$	0.7	1.7
$e_{2}$	1.1	2.4
$e_{3}$	1.3	2.5
$e_{4}$	1.9	1.7
$e_{5}$	2.6	2.1
$e_{6}$	3.1	2.3
$e_{7}$	3.9	7
$e_{8}$	2.9	?
$e_{9}$	5.0	?

Figure 7.2: Examples for a toy regression task

Figure 7.2 shows some data for a regression task, where the aim is to predict the value of feature $Y$ on examples for which the value of feature $X$ is provided. This is a regression task because $Y$ is a real-valued feature. Predicting a value of $Y$ for example $e_{8}$ is an interpolation problem, as its value for the input feature is between two of the values of the training examples. Predicting a value of $Y$ for example $e_{9}$ is an extrapolation problem, because its $X$ value is outside the range of the training examples.

7.2.1 Evaluating Predictions

Suppose $E$ is a bag of examples and $Y$ is a target feature. Given example $e\in E$ , the actual value of $Y$ for $e$ , the ground truth, is ${Y}({e})$ . $|E|$ is the number of examples in $E$ .

A predictor for target feature $Y$ is a function from the domains of the input features into the domain of $Y$ . A predictor for $Y$ is written as $\widehat{Y}$ , so $\widehat{Y}({e})$ is the predicted value for target feature $Y$ on example $e$ . $\widehat{Y}$ can only use the input features of an example for its prediction. $\widehat{Y}$ is built using examples in $E$ , but should be applicable for all possible examples.

A point estimate for target feature $Y$ on example $e$ is a prediction of $\widehat{Y}({e})$ that is a number if $Y$ is real or Boolean, or can be a vector if $Y$ has a finite set of possible values or is vector-valued. For example, if $Y$ is Boolean, or has domain $\{0,1\}$ , a point prediction could be 1 or 0.7 or even 1.3 (although 1.3 would never be a good prediction). An example of a prediction that is not a point estimate is the prediction that the value of real-valued variable $Y$ lies between 1.7 and 1.8. If $Y$ is discrete with values red, yellow, and green, the prediction can be a vector containing a number for each of the three values.

The loss for example $e$ on feature $Y$ is a measure of how close the prediction $\widehat{Y}({e})$ is to the actual value ${Y}({e})$ . The measures below define a nonnegative real-valued function $loss(p,a)$ that gives the loss for prediction $p$ on an example when the actual value is $a$ .

A common error function on a dataset is the mean loss, which for predictor $\widehat{Y}$ on a dataset $E$ is

\frac{1}{|E|}\sum_{e\in E}loss(\widehat{Y}({e}),{Y}({e}))

where $|E|$ is the number of examples in $E$ . An alternative error is the sum of losses (without ${1}/{|E|}$ ). When finding a predictor to minimize the error, either the mean or sum can be used, because they are a minimum for the same predictors. When implementing an algorithm, it is important to check whether the mean or the sum is used, particularly when this value is added to another one. The mean loss is typically reported as the error for a dataset because it can be interpreted without knowing the number of examples.

Real-valued Target Features

Real-valued features are used when the values are totally ordered, and the differences are meaningful. This covers cases where the values are all integers as well as cases where the values can be arbitrary reals. For example, height in centimeters, student marks, and number of children could all be real-valued features. Clothing sizes, when mapped to numbers, could also be considered real-valued.

For regression, when the target feature $Y$ is real-valued, when both the actual and the prediction are numbers, the following losses are common:

•

The 0–1 loss, or L0 loss, has

$loss(p,a)=\left\{\begin{array}[]{ll}1&\mbox{ if }p\neq a\\ 0&\mbox{ if }p=a\end{array}\right.$

This is sometimes written as $\mathbf{1}(p\neq a)$ , where $\mathbf{1}$ is a function from Booleans into $\{0,1\}$ , defined by $\mathbf{1}(true)=1$ and $\mathbf{1}(false)=0$ .

The mean 0–1 loss of a dataset is the average number of predictions that are wrong. It does not take into account how wrong the predictions are, just whether they are correct or not. The accuracy of a predictor on a dataset is one minus the mean 0–1 loss, which is the number of correct predictions divided by the number of examples. Accuracy is maximized when 0–1 loss is minimized.

Testing equality for real or floating-point numbers is unreliable. This means that 0–1 loss may not be appropriate for the prediction of the mean, but is still applicable for predicting the mode or the median.
•

The absolute loss, or L1 loss, is

$loss(p,a)=|p-a|$

the absolute difference between the predicted and actual value. This is always nonnegative. The mean absolute loss of a dataset is only zero when the predictions exactly fit the observed values. Unlike for the 0–1 loss, close predictions are better than far-away predictions.
•

The squared loss, or L2 loss, is

$loss(p,a)=(p-a)^{2}.$

This measure treats large losses as much worse than small losses. For example, a loss of 2 on an example is as bad as 4 losses of 1, and a loss of 10 on one example is as bad as 100 losses of 1. Minimizing mean squared loss is equivalent to minimizing the root-mean-square (RMS) error, the square root of the mean squared loss, because the square root function is a monotonically increasing function on nonnegative values.

Sometimes you might see squared loss as $\frac{1}{2}(p-a)^{2}$ . The $\frac{1}{2}$ does not affect the minimization, but makes some calculations simpler (in particular, when taking derivatives).
•

The worst-case loss, or L $\infty$ loss, on examples $E$ is the maximum absolute difference

$\max_{e\in E}\left|\widehat{Y}({e})-{Y}({e})\right|.$

In this case, the learner is evaluated by its worst prediction. This is the only error covered that is not a mean or sum.

Example 7.3.

Consider the data of Figure 7.2. Figure 7.3 shows a plot of the training data (filled circles) and three lines, $L1$ , $L2$ , and $L\infty$ , that predict the $Y$ -value for all $X$ points. $L1$ minimizes the mean absolute loss, $L2$ minimizes the mean squared loss, and $L\infty$ minimizes the mean worst-case loss of the training examples.

As no three points are collinear, any line through any pair of the points minimizes the 0–1, $L0$ , loss.

Figure 7.3: Linear regression predictions for a simple prediction example. Filled circles are the training examples.

L1

is the prediction that minimizes the mean absolute loss of the training examples.

L2

is the prediction that minimizes the mean squared loss of the training examples.

L\infty

is the prediction that minimizes the worst-case loss of the training examples. See Example 7.3

Lines $L1$ and $L2$ give similar predictions for $X{\,{=}\,}1.1$ ; namely, $L1$ predicts 1.805 and $L2$ predicts 1.709, whereas the data contain a data point $(1.1,2.4)$ . $L\infty$ predicts 0.7. They give predictions within 1.5 of each other when interpolating in the range $[1,3]$ . Their predictions diverge when extrapolating from the data. $L1$ and $L\infty$ give very different predictions for $X{\,{=}\,}5$ .

An outlier is an example that does not follow the pattern of the other examples. The difference between the lines that minimize the various error measures is most pronounced in how they handle outliers. The point $(3.9,7)$ can be seen as an outlier as the other points are approximately in a line.

The prediction with the lowest worse-case loss for this example, $L\infty$ , only depends on three data points, $(1.1,2.4)$ , $(3.1,2.3)$ , and $(3.9,7)$ , each of which has the same worst-case loss for prediction $L\infty$ . The other data points could be at different locations, as long as they are not farther away from $L\infty$ than these three points.

A prediction that minimizes the absolute loss, $L1$ , does not change as a function of the actual $Y$ -value of the training examples, as long as the points above the line stay above the line, and those below the line stay below. For example, the prediction that minimizes the absolute loss would be the same, even if the last data point was $(3.9,107)$ instead of $(3.9,7)$ .

Prediction $L2$ is sensitive to all of the data points; if the $Y$ -value for any point changes, the line that minimizes the squared loss will change. Changes to outliers will have more effect on the line than changes to points close to the line.

Categorical Features

A categorical feature is one where the domain is a fixed finite set. Examples include the main topic of a news article (when each news article only has one main topic), the species of an animal, the country of birth of a person (perhaps with the extra value for those who were not born in a country), or the letter of the alphabet for a hand-written letter.

Suppose target variable $Y$ is categorical with domain $D=\{v_{1},\dots,v_{k}\}$ . A point estimate is one of the following two forms:

•

A definitive prediction where the prediction is one of $v_{1},\dots,v_{k}$ .
•

A probabilistic prediction $p$ , a function or dictionary that maps the domain into nonnegative real numbers such that $\sum_{v\in D}p[v]=1$ , where $p[v]$ is the value that prediction $p$ makes for value $v$ . Thus a probabilistic prediction makes a nonnegative prediction for each value in the domain, and the sum of the predictions is 1.

A definitive prediction of $v_{j}$ is equivalent to the probabilistic prediction with $p[v_{j}]=1$ and the other numbers in the probabilistic prediction are all 0.

Whether a variable is real-valued or categorical is often a matter of data design – how the data is represented.

Example 7.4.

A trading agent wants to learn a person’s preference for the length of holidays. The holiday can be for 1, 2, 3, 4, 5, or 6 days. See Figure 7.4 for example data. Note that there are no input features, and one output feature in this example.

Example	$Y$
$e_{1}$	1
$e_{2}$	6
$e_{3}$	6
$e_{4}$	2
$e_{5}$	1

Figure 7.4: Fictitious data of the number of days of a holiday

One representation is to treat the target as a real-valued variable $Y$ that is the number of days in the holiday. A prediction for a new example $e$ can be any real number, such as $\widehat{Y}({e}){\,{=}\,}3.2$ .

The target could also be treated as categorical with domain $\{1,2,3,4,5,6\}$ . A definitive prediction for example $e$ might be the prediction $\widehat{Y}({e}){\,{=}\,}1$ . A probabilistic prediction for example $e$ might be the prediction $\widehat{Y}({e})$ represented by the dictionary $p$ with

\begin{array}[]{lll}p[1]=0.25&p[2]=0.2&p[3]=0.1\\ p[4]=0.1&p[5]=0.1&p[6]=0.25\end{array}

which means that it is predicting the holiday is of length 1 with probability 0.25, length 2 with probability 0.2, and so on.

The losses used for real-valued target features can be applied to categorical features by using $loss(p[a],1)$ for an actual $a$ . For example, if the actual is $a$ , the squared loss of probabilistic prediction is $(1-p[a])^{2}$ .

It is common to report mean accuracy. Accuracy for a definitive prediction is 1 if the prediction is correct, and 0 otherwise. For probabilistic predictions, the accuracy is 1 when there is a unique mode (the $v$ with the highest corresponding $p[v]$ ) that corresponds to the actual value. Accuracy is a crude measure of how accurate probabilistic predictions are, as it only depends on the mode.

A probabilistic prediction $p$ is typically optimized with categorical log loss or categorical cross entropy, often just known as log loss, where the loss for prediction $p$ for actual $a$ is

logloss(p,a)=-\log p[a].

That is, when the actual is $a$ , log loss selects the corresponding prediction, $p[a]$ , and returns its negative logarithm.

The log loss is always greater than or equal to 0. It is only 0 when $p[a]=1$ and all other probabilities are 0.

For categorical $Y$ and dataset $E s$ , predictor $\widehat{Y}$ has mean log loss

-\frac{1}{|Es|}\sum_{e}\log\widehat{Y}({e})[{Y}({e})]

where $\widehat{Y}({e})[{Y}({e})]$ is the value of the prediction $\widehat{Y}({e})$ evaluated for the actual ${Y}({e})$ for example $e$ .

Example 7.5.

The probabilistic prediction from Example 7.4 for the dataset of Figure 7.4, with four data points ( $e_{1}$ , $e_{2}$ , $e_{3}$ , and $e_{5}$ ) having the prediction 0.25, and one ( $e_{4}$ ) having the prediction 0.2 has mean log loss (where the logarithm is base 2):

-\frac{4*\log_{2}0.25+\log_{2}0.2}{5}\approx 2.064.

The log loss of the prediction with $p[1]=p[6]=0.4$ and $p[2]=0.2$ , with the others 0, has mean log loss (base 2)

-\frac{4*\log_{2}0.4+\log_{2}0.2}{5}\approx 1.522.

Thus, the second prediction, which is using the proportion of the data for each value, is better than the other prediction.

There are two main justifications for log loss; one is in terms of probability, and one in terms of decision making.

•

You might want the model that best predicts the data; this is the model where the data is most likely given the model. The likelihood of the examples $E s$ given model $\widehat{Y}$ , assuming that the examples are independent, is the product of the prediction for each example:

$\prod_{e\in Es}\widehat{Y}({e})[Y(e)].$

Taking the logarithm gives the log-likelihood

$\sum_{e\in Es}\log\widehat{Y}({e})[Y(e)].$

The mean log loss is negative of the log-likelihood divided by the number of examples. Thus, log loss is minimized when the log likelihood – and so also the likelihood – is maximized.
•

Mean log loss is appropriate when the prediction is used as a probability for gambling and other reasoning under uncertainty (see Chapter 9). For example, under squared loss, $10^{-7}$ and $10^{-6}$ are very close; a prediction of $10^{-7}$ will have a very similar error to a prediction of $10^{-6}$ . However, as probabilities they are very different. An insurance company making its decisions based on a probability of $10^{-7}$ will lose a lot of money if the correct probability is $10^{-6}$ ; the event will occur 10 times as much as expected. This difference is reflected in the log loss, but not in the losses that only use the differences between the actual and predicted.

The logarithm of 0 is undefined. As $\epsilon>0$ gets closer to 0, $-\log\epsilon$ gets larger. For this reason, log loss of the prediction of 0 is usually treated as infinity. Averaging infinity with any finite set of numbers is infinity. Log loss discourages any probabilistic prediction of 0 for a value that is possible.

Log loss is closely related to the notion of entropy. The log loss (when the base of the logarithm is 2) can be seen as the mean number of bits it will take to encode the data given a code that is based on $\widehat{Y}$ treated as a probability. The base of the logarithm provides a constant difference, which doesn’t matter when the goal is to minimize the loss. However, when reporting results it is important to specify the base of the logarithm, where both 2 and $e$ (the base of the natural logarithm) are common.

Information Theory

A bit is a binary digit. Because a bit has two possible values (0 and 1), it can be used to distinguish two items. Two bits can distinguish four items, each associated with either $00$ , $01$ , $10$ , or $11$ . In general, $n$ bits can distinguish $2^{n}$ items. Thus, $n$ items can be distinguished with $\log_{2}n$ bits (or the smallest integer greater than or equal to this value). It may be surprising, but you can do better than this using probabilities.

Consider this code to distinguish the elements of the set $\{a,b,c,d\}$ , with $P(a)=\frac{1}{2}$ , $P(b)=\frac{1}{4}$ , $P(c)=\frac{1}{8}$ , and $P(d)=\frac{1}{8}$ :

\begin{array}[]{rlrl}a&0&\hskip 56.9055ptc&110\\ b&10&d&111\end{array}

This code sometimes uses one bit, sometimes two bits and sometimes three bits. On average, it uses

P(a)*1+P(b)*2+P(c)*3+P(d)*3=\frac{1}{2}+\frac{2}{4}+\frac{3}{8}+\frac{3}{8}=1% \frac{3}{4}\mbox{ bits.}

For example, the string $a a c a b b d a$ with 8 characters has code $00110010101110$ , which uses 14 bits.

With this code, $-\log_{2}P(a)=1$ bit is required to distinguish $a$ from the other symbols. Distinguishing $b$ uses $-\log_{2}P(b)=2$ bits. Distinguishing $c$ or $d$ requires $-\log_{2}P(c)=3$ bits.

It is possible to build a code that, to identify $x$ , requires $-\log_{2}P(x)$ bits (or the smallest integer greater than this). Suppose there is a sequence of symbols you want to transmit or store and you know the probability distribution over the symbols. A symbol $x$ with probability $P(x)$ can use $-\log_{2}P(x)$ bits. To transmit a sequence, each symbol requires, on average,

\sum_{x}-P(x)*\log_{2}P(x)

bits to send, where the sum is over each value $x$ in a domain. This is called the information content or entropy of the distribution.

Suppose $P$ and $Q$ are both probability distributions. The expected number of bits to describe $Q$ using a code optimized for $P$ is

\sum_{x}-Q(x)*\log_{2}P(x).

This is the cross entropy of distribution $P$ relative to $Q$ . For a fixed $Q$ , cross entropy is minimized when $P=Q$ . In machine learning, $P$ is typically the learned model and $Q$ the distribution of the data. Log loss allows for a separate prediction for each example.

Boolean and Other Binary Features

When the domain of $Y$ is binary, one value can be associated with 0, the other value with 1. For Boolean features, with domain $\{false,true\}$ , it is traditional to associate 0 with $f a l s e$ and 1 with $t r u e$ . Boolean variables are very common; for example, predicting the topics of a news article that can have multiple topics can be modeled using a Boolean variable for each topic.

A Boolean target variable can be treated either as a real-valued prediction or as a categorical prediction. The real-valued prediction $p$ for variable $Y$ is equivalent to the categorical prediction where the prediction for 1 is $p$ and the prediction for 0 is $1-p$ .

The binary log loss, or binary cross-entropy loss, for prediction $p$ and actual value $a$ is defined by

logloss(p,a)=-a\log p-(1-a)\log(1-p)

and by convention, $logloss(1,1)=logloss(0,0)=0$ , even though $\log 0$ is undefined (or infinity). Log loss is $\log p$ when $a=1$ and $\log(1-p)$ when $a=0$ , and so is the same as categorical log loss where the prediction for value 1 is $p$ and the prediction for value 0 is $1-p$ . The term log loss includes both categorical log loss and binary log loss; which is meant should be clear from the context.

A binary point prediction could be any real number or could be restricted to be 0 or 1. Here we assume that the prediction can be any real number, except where explicitly noted.

7.2.2 Point Estimates with No Input Features

The simplest case for learning is when there is a single target feature and the input features are ignored (or there are no input features). In this case, a learning algorithm predicts a single value for the target feature for all of the examples. This forms a naive baseline that any more sophisticated algorithm should be able to beat. It is also the base case for some of the learning algorithms.

Suppose $E$ is a bag of $n$ examples, $Y$ is a numeric feature, and the learner makes a point estimate, $p$ , for all examples.

Loss	Error for prediction $p$	Optimal prediction
	for training data	for training data
mean 0–1 loss	$\frac{1}{n}\sum_{e}\mathbf{1}(Y(e)\neq p)$	$m o d e$
mean absolute loss	$\frac{1}{n}\sum_{e}\|Y(e)-p\|$	$m e d i a n$
mean squared loss	$\frac{1}{n}\sum_{e}(Y(e)-p)^{2}$	$m e a n$
worst case	$\max_{e}\|Y(e)-p\|$	$(min+max)/2$
Special case for domain of $Y$ is $\{0,1\}$ , $n_{1}=\sum_{e}(Y(e))$ , $n_{0}=n-n_{1}$ .
mean log loss	$-\frac{n_{1}}{n}\log p-\frac{n_{0}}{n}\log(1-p)$ $\frac{n_{1}}{n}$
For categorical $Y$ with domain $\{v_{1},\dots,v_{k}\}$ , value $v_{i}$ occurs $n_{i}$ times in training data, and prediction $p$ has $p[v_{i}]=p_{i}$ .
mean log loss	$-\sum_{i}\frac{n_{i}}{n}*\log p_{i}$	$p[v_{i}]=\frac{n_{i}}{n}$

Figure 7.5: Optimal predictions on training data with no input features. The training data consist of

n

examples

e_{1},\dots,e_{n}

with single real-valued feature

Y

The optimal prediction depends on the optimality criterion. Figure 7.5 gives the optimal prediction for various optimality criteria for the case with no input features and a single real-valued feature $Y$ , where the training set consisting of $n$ examples, namely the numbers $Y(e_{1}),\dots,Y(e_{m})$ . The mode is the value that occurs most often. The median is the middle value after sorting the values; if there are two middle values, either of them, or any value between them, will minimize the absolute error. The mean is the sum of the values divided by the number of values. The best worst-case error is the mean of the minimum value and the maximum value. Exercise 7.1 asks about the proofs.

When the target feature has domain $\{0,1\}$ , the $n$ training examples can be summarized in two numbers: $n_{1}$ , the number of examples with value 1 and $n_{0}$ , the number of examples with value 0 ( $n_{0}=n-n_{1}$ ). The mean is the empirical frequency: the proportion of 1s in the training data, namely ${n_{1}}/{n}$ . This is also the maximum likelihood estimate, and the prediction with the minimum log loss. The 0–1 error and the absolute error are minimized by predicting which of 0 or 1 occurs more often.

When the target feature is categorical with domain $\{v_{1},\dots,v_{k}\}$ , a dataset can be summarized by $k$ numbers $n_{1},\dots,n_{k}$ , where $n_{i}$ is the number of occurrences of $v_{i}$ in the dataset. In this case, the prediction that minimizes log loss on the training set is the empirical frequency. The prediction that minimizes 0–1 error and maximizes accuracy is to predict the mode; a value that appears most often.

Example 7.6.

Consider the length of holiday data of Example 7.4. Suppose only the target is given, and there are no input features, so all of the examples get the same prediction.

In the first representation, the prediction of 1 and the prediction of 6 both minimize 0–1 error and so maximize accuracy on the training data. The prediction that minimizes the absolute loss is 2, with an total absolute loss of 10, and a mean absolute loss of 2. The prediction that minimizes the squared error on the training data is 3.2. The prediction the minimizes the worst-case error is 3.5.

Treating the feature as categorical, the prediction that minimizes log loss in training error is $(0.4,0.2,0,0,0.4)$ , which is the empirical distribution of the training data.

Thus, which prediction is preferred depends on how the prediction is represented and how it will be evaluated.

This analysis does not specify the optimal prediction for unseen examples. You should not expect the empirical frequency of the training data to be the optimal prediction for new examples when maximizing the likelihood or minimizing the log loss. If one $n_{i}=0$ , the corresponding $v_{i}$ does not appear in the training data. However, if just one test example has value $v_{i}$ , the likelihood would be 0 (its lowest possible value) and the log loss would be infinite (or undefined). This is an example of overfitting. See Exercise 7.1.

Optimality Criterion	Naive Prediction	Mean Loss
mean 0–1 loss	1	$\leq 1$
mean absolute loss	0.5	0.5
mean squared loss	0.5	0.25
worst case error	0.5	0.5
mean log loss (base 2)	0.5	1
mean log loss (base $e$ )	0.5	0.693…

Figure 7.6: Baseline errors any method should beat, for target domain

\{0,1\}

Figure 7.6 provides mean losses that any method for binary target domains should beat, as a function of the optimality criterion. The naive prediction is a prediction that can be made before seeing any data. These losses act as a quick check for determining if a method is not working. For example, if the mean squared loss of a predictor for 0/1 variable is greater than 0.25, the method is not working.

Given a dataset, for the optimality criterion used, you should compute the average loss for the optimal prediction on the training set – ignoring the input features – as a baseline to beat. Often you will find such a baseline is difficult to beat by a lot.

7.2.3 Types of Errors

Not all errors are equal; the consequences of some errors may be much worse than others. For example, it may be much worse to predict a patient does not have a disease that they actually have, so that the patient does not get appropriate treatment, than it is to predict that a patient has a disease they do not actually have, which will force the patient to undergo further tests, but may result in more anxiety.

Consider a simple case where the domain of the target feature is Boolean (which you can consider as “positive” and “negative”) and the predictions are restricted to be Boolean. One way to evaluate a prediction independently of the decision is to consider the four cases between the predicted value and the actual value:

	actual positive ( $a p$ )	actual negative ( $a n$ )
predict positive ( $p p$ )	true positive ( $t p$ )	false positive ( $f p$ )
predict negative ( $p n$ )	false negative ( $f n$ )	true negative ( $t n$ )

A false-positive error or type I error is a positive prediction that is wrong (i.e., the predicted value is positive and the actual value is negative). A false-negative error or type II error is a negative prediction that is wrong (i.e., the predicted value is negative and the actual value is positive). For a given predictor for a given set of examples, suppose $t p$ is the number of true positives, $f p$ is the number of false positives, $f n$ is the number of false negatives, and $t n$ is the number of true negatives.

Different choices of learning algorithm or parameters can affect the number of false positives and false negatives. Suppose false positives are $c$ times as bad as false negatives. $c<1$ means false negatives are worse than false positives, $c=1$ means they are equally as bad, and $c>1$ means false positives are worse than false negatives. Assuming the costs of errors can be added, the cost of a prediction is proportional to

c*fp+fn.

One would then select the predictor with the lowest cost.

Sometimes you want to evaluate a predictor without knowing anything about the relative costs of different types of errors.

The following measures are often used:

•

The recall or true-positive rate is $\frac{tp}{tp+fn}$ , the proportion of actual positives that are predicted to be positive.
•

The false-positive rate is $\frac{fp}{fp+tn}$ , the proportion of actual negatives predicted to be positive.

An agent should try to maximize the true-positive rate and minimize the false-positive rate; however, these goals are incompatible. An agent can maximize the true-positive rate by making positive predictions about all cases it is sure about (assuming it is correct about these). However, this choice maximizes the false-positive rate because more of the positives will be wrong.

Predictor $A$ dominates $B$ if $A$ has a higher true-positive rate and a lower false-positive rate than $B$ . If $A$ dominates $B$ then $A$ has a lower cost (and so is better) than $B$ for all cost functions that depend only on the number of false positives and false negatives (assuming the costs to be minimized are additive and nonnegative). See Exercise 7.3.

For a given set of examples where both the prediction and the actual are given, a receiver operating characteristic space, or an ROC space, plots the false-positive rate against the true-positive rate for various predictors. Each predictor becomes a point in the space.

Example 7.7.

Consider a case where there are 100 examples that are actually positive $(ap)$ and 1000 examples that are actually negative $(an)$ . Figure 7.7 shows the performance of six possible predictors for these 1100 examples. Predictor (a) predicts 70 of the positive examples correctly and 850 of the negative examples correctly. Predictor (e) predicts every example as positive, and (f) predicts all examples as negative.

(a)	$a p$	$a n$
$p p$	70	150
$p n$	30	850

(b)	$a p$	$a n$
$p p$	50	25
$p n$	50	975

(c)	$a p$	$a n$
$p p$	98	200
$p n$	2	800

(d)	$a p$	$a n$
$p p$	90	500
$p n$	10	500

(e)	$a p$	$a n$
$p p$	100	1000
$p n$	0	0

(f)	$a p$	$a n$
$p p$	0	0
$p n$	100	1000

Figure 7.7: Six predictors in the ROC space

In the ROC space, any predictor lower and to the right of another predictor is dominated by the other predictor. For example, (d) is dominated by (c); there would be no reason to choose (d) if (c) were available as a predictor.

Consider predictions (a) and (c) in Figure 7.7. The true-positive rate of (a) is 0.7 and the false-positive rate is 0.15. Predictor (c) has a true-positive rate of 0.98 and a false-positive rate of 0.2. If false positives were much more important than false negatives, then (a) would be better than (c), as it has fewer false positives. If false negatives were much more important (much worse) than false positives, then (c) would be better than (a), as it has fewer false negatives. Neither dominates the other in the ROC space.

Any predictor that is below the upper envelope of predictors (shown with line segments in Figure 7.7) is dominated by the other predictors. For example, although (a) is not dominated by (b) or by (c), it is dominated by the randomized predictor: with probability 0.5 use the prediction of (b), else use the prediction of (c). This randomized predictor would expect to have 26 false negatives and 112.5 false positives. The line between (b) and (c) reflects all probabilistic mixes of (b) and (c).

For some algorithms, there is a parameter which, when varied, results in predictors with different false-positive and true-positive rates. For example, a model that returns a real value can predict true if the output value is greater than a threshold, and false if the output value is less than the threshold. As the threshold varies, so do the predictions. In such cases, a single algorithm can provide a curve in the ROC space. One algorithm is better than another algorithm if its curve is above and to the left of the other. Often two algorithms have different parts of the curve where they dominate, in which case the preferred algorithm depends on the relative costs of the two types of errors. The area under the ROC curve (AUROC) is a single number that can be used to compare algorithms across the whole parameter space. One predictor having a larger AUROC than another means it is better for some cost functions, but does not imply it is better for all cost functions.

Another common measure is precision, which is $\frac{tp}{tp+fp}$ , the proportion of positive predictions that are actual positives. Some have suggested comparing algorithms just using precision and recall. However, it is possible that one algorithm might have a better (higher) recall and a better (higher) precision than another but be a worse algorithm for some cost functions. See Exercise 7.3.

Artificial Intelligence 3E