7.4 Overfitting

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

7.4.1 Pseudocounts

For many of the prediction measures, the optimal prediction on the training data is the mean (the average value). In the case of Boolean data (assuming true is represented as 1, and false as 0), the mean can be interpreted as a probability. However, the empirical mean, the mean of the training set, is typically not a good estimate of the probability of new cases. For example, just because an agent has not observed some value of a variable does not mean that the value should be assigned a probability of zero, which means it is impossible. Similarly, if we are to make predictions for the future grades of a student, the average grade of a student may be appropriate to predict the future grades of the student if the student has taken many courses, but may not be appropriate for a student with just one grade recorded, and is not appropriate for a student with no grades recorded (where the average is undefined).

A simple way both to solve the zero-probability problem and to take prior knowledge into account is to use a real-valued pseudocount or prior count to which the training data is added.

Suppose the examples are values v1vn and you want to make a prediction for the next v, which we will write as v^.

One prediction is the average. Suppose an is the average of the first n values, then:

an =v1++vn-1+vnn
=n-1n*an-1+vnn
=an-1+vn-an-1n.

The running average keeps the current average of all of the data points seen. It can be implemented by storing the current average, a, and the number of values seen, n. When a new value v arrives, n is incremented and (v-a)/n is added to a.

When n=0 assume you use prediction a0 (which you cannot get from data as there are no data for this case). A prediction that takes into account regression to the mean is to use:

v^=v1++vn+c*a0n+c

where c is a constant, which is the pseudocount of the number of assumed fictional data points. If c=0, the prediction is the average value. The value of c can control the amount of regression to the mean. This can be implemented using the running average by initializing a with a0 and n with c.

Example 7.17.

Consider how to better estimate the ratings of restaurants in Example 7.14. The aim is to predict the average rating over the test data, not the average rating of the seen ratings.

You can use the existing data about other restaurants to make estimates about the new cases, assuming that the new cases are like the old. Before seeing anything, it may be reasonable to use the average rating of the restaurants as value for a0. This would be like assuming that a new restaurant is like an average restaurant (which may or may not be a good assumption). Suppose you are most interested in being accurate for top-rated restaurants. To estimate c, consider a restaurant with a single 5-star rating. You could expect this restaurant to be like the other restaurants with a 5-star rating. Let a be the average rating of the restaurants with a 5-star rating (where the average is weighted by the number of ratings of 5 stars each restaurant has). Then you would expect that a restaurant with a single 5-star rating would be like the others and have this rating, and so a=a0+(5-a0)/(c+1). You can then solve for c.

Suppose the average rating is 3, and the average rating for the restaurants with a 5-star rating is 4.5. Solving 4.5=3+(5-3)/(c+1) gives c=1/3. If the average for 5-star restaurants was instead 3.5, then c would be 3. See Exercise 12.

Example 7.18.

Consider the following thought experiment (or, better yet, implement it). First select a number p randomly from the range [0,1]. Suppose this is the ground truth for the probability that Y=1 for a variable Y with domain {0,1}. Then generate n training examples with P(Y=1)=p, for a number for values of n, such as 1,2,3,4,5,10,20,100,1000. Let n1 be the number of samples with Y=1 and so there are n0=n-n1 samples with Y=0. The learning problem for this scenario is: from n0 and n1 create an estimator p^ that can be used to predict new cases. Then generate some (e.g., 100) test cases from the same p. The aim is to produce the estimator p^ with the smallest error on the test cases. If you repeat this 1000 times, you will get a good idea of what is going on.

If you try this, with log-likelihood you will find that p^=n1/(n0+n1) works very poorly; one reason is that if either n0 or n1 is 0, and that value appears in the test set, the likelihood of the test set will be 0, which is the worst it could possibly be! It turns out that Laplace smoothing, defined by p^=(n1+1)/(n0+n1+2), has the maximum likelihood of all estimators on the test set. p^=(n1+1)/(n0+n1+2) also works better than p^=n1/(n0+n1) for sum-of-squares error.

If you were to select p from some distribution other than the uniform distribution, adding 1 to the numerator and 2 from the denominator may not result in the best predictor.