foundations of computational agents
The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).
Not all errors are equal; the consequences of some errors may be much worse than others. For example, it may be much worse to predict a patient does not have a disease that the patient actually has, so that the patient does not get appropriate treatment, than it is predict that a patient has a disease the patient does not actually have, which will force the patient to undergo further tests.
A prediction can be seen as an action of a predicting agent. The agent should choose the best prediction according to the costs associated with the errors. What an agent should do when faced with decisions under uncertainty is discussed in Chapter 9. The actions may be more than true or false, but may be more complex, such as “watch for worsening symptoms” or “go and see a specialist.”
Consider a simple case where the domain of the target feature is Boolean (which we can consider as “positive” and “negative”) and the predictions are restricted to be Boolean. One way to evaluate a prediction independently of the decision is to consider the four cases between the predicted value and the actual value:
actual positive ($ap$)  actual negative ($an$)  

predict positive ($pp$)  true positive ($tp$)  false positive ($fp$) 
predict negative ($pn$)  false negative ($fn$)  true negative ($tn$) 
A falsepositive error or type I error is a positive prediction that is wrong (i.e., the predicted value is true, and the actual value is false). A falsenegative error or type II error is a negative prediction that is wrong (i.e., the predicted value is false, and the actual value is true).
A predictor or predicting agent could, at one extreme, choose to only claim a positive prediction for an example when it is sure the example is actually positive. At the other extreme, it could claim a positive prediction for an example unless it is sure the example is actually negative. It could also make predictions between these extremes. We can separate the question of whether a predicting agent has a good learning algorithm from whether it makes good predictions based on preferences or costs that are outside the learner.
For a given predictor for a given set of examples, suppose $tp$ is the number of true positives, $fp$ is the number of false positives, $fn$ is the number of false negatives, and $tn$ is the number of true negatives. The following measures are often used:
The precision is $\frac{tp}{tp+fp}$ the proportion of positive predictions that are actual positives.
The recall or truepositive rate is $\frac{tp}{tp+fn}$ the proportion of actual positives that are predicted to be positive.
The falsepositive rate is $\frac{fp}{fp+tn}$ the proportion of actual negatives predicted to be positive.
An agent should try to maximize precision and recall and to minimize the falsepositive rate; however, these goals are incompatible. An agent can maximize precision and minimize the falsepositive rate by only making positive predictions it is sure about. However, this choice worsens recall. To maximize recall, an agent can be risky in making predictions, which makes precision smaller and the falsepositive rate larger.
To compare predictors for a given set of examples, an ROC space, or receiver operating characteristic space, plots the falsepositive rate against the truepositive rate. Each predictor for these examples becomes a point in the space.
A precisionrecall space plots the precision against the recall. Each of these approaches may be used to compare learning algorithms independently of the actual costs of the prediction errors.
Consider a case where there are 100 examples that are actually positive ${\mathrm{(}}{a}{\mathit{}}{p}{\mathrm{)}}$ and 1000 examples that are actually negative ${\mathrm{(}}{a}{\mathit{}}{n}{\mathrm{)}}$. Figure 7.4 shows the performance of six possible predictors for these 1100 examples. Predictor (a) predicts 70 of the positive examples correctly and 850 of the negative examples correctly. Predictor (e) predicts every example as positive, and (f) predicts all examples as negative. The precision for (f) is undefined.



[1em] (d) ${a}{}{p}$ ${a}{}{n}$ ${p}{}{p}$ 90 500 ${p}{}{n}$ 10 500 (e) ${a}{}{p}$ ${a}{}{n}$ ${p}{}{p}$ 100 1000 ${p}{}{n}$ 0 0 (f) ${a}{}{p}$ ${a}{}{n}$ ${p}{}{p}$ 0 0 ${p}{}{n}$ 100 1000
The recall (true positive rate) of (a) is 0.7, the false positive rate is 0.15, and the precision is ${\mathrm{70}}{\mathrm{/}}{\mathrm{220}}{\mathrm{\approx}}{\mathrm{0.318}}$. Predictor (c) has a recall of 0.98, a falsepositive rate of 0.2 and a precision of ${\mathrm{98}}{\mathrm{/}}{\mathrm{298}}{\mathrm{\approx}}{\mathrm{0.329}}$. Thus (c) is better than (a) in terms of precision and recall, but is worse in terms of the false positive rate. If false positives were much more important than false negatives, then (a) would be better than (c). This dominance is reflected in the ROC space, but not the precisionrecall space.
In the ROC space, any predictor lower and to the right of another predictor is worse than the other predictor. For example, (d) is worse than (c); there would be no reason to choose (d) if (c) were available as a predictor. Any predictor that is below the upper envelope of predictors (shown with line segments in Figure 7.4), is dominated by the other predictors. For example, although (a) is not dominated by (b) or by (c) it is dominated by the randomized predictor: with probability 0.5 use the prediction of (b), else use the prediction of (c). This randomized predictor would expect to have 26 false negatives and 112.5 false positives.