Artificial Intelligence - foundations of computational agents -- 7.2.3 Learning Probabilities

Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

7.2.3 Learning Probabilities

For many of the prediction measures, the optimal prediction on the training data is the empirical frequency. Thus, making a point estimate can be interpreted as learning a probability. However, the empirical frequency is typically not a good estimate of the probability of new cases; just because an agent has not observed some value of a variable does not mean that the value should be assigned a probability of zero. A probability of zero means that the value is impossible.

Typically, we do not have data without any prior knowledge. There is typically a great deal of knowledge about a domain, either in the meaning of the symbols or in experience with similar examples that can be used to improve predictions.

A standard way both to solve the zero-probability problem and to take prior knowledge into account is to use a pseudocount or prior count for each value to which the training data is added.

Suppose there is a binary feature Y, and an agent has observed n₀ cases where Y=0 and n₁ cases where Y=1. The agent can use a pseudocount c₀ ≥ 0 for Y=0 and a pseudocount c₁ ≥ 0 for Y=1 and estimate the probability as

P(Y=1)=(n₁+c₁)/(n₀+c₀+n₁+c₁).

This takes into account both the data and the prior knowledge. This formula can be justified in terms of a prior on the parameters (see Section 7.8). Choosing pseudocounts is part of designing the learner.

More generally, suppose Y has domain {y₁,...,y_k}. The agent starts with a pseudocount c_i for each y_i. These counts are chosen before the agent has seen any of the data. Suppose the agent observes some training examples, where n_i is the number of data points with Y=y_i. It can then use

P(Y=y_i)=(c_i+n_i)/(∑_i' c_i'+n_i').

To determine the pseudocounts, consider the question, "How much more should an agent believe y_i if it had seen one example with y_i true than if it had seen no examples with y_i true?" If, with no examples of y_i true, the agent believes that y_i is impossible, c_i should be zero. If not, the ratio chosen in answer to that question should be equal to the ratio (1+c_i):c_i. If the pseudocount is 1, a value that has been seen once would be twice as likely as one that has been seen no times. If the pseudocount is 10, a value observed once would be 10% more likely than a value observed no times. If the pseudocount is 0.1, a value observed once would be 11 times more likely than a value observed no times. If there is no reason to choose one value in the domain of Y over another, all of the values of c_i should be equal.

If there is no prior knowledge, Laplace(1812) suggested that it is reasonable to set c_i=1. See Section 7.8 for a justification of why this may be reasonable. We will see examples where it is not appropriate, however.

The same idea can be used to learn a conditional probability distribution. To estimate a conditional distribution, P(Y|X), of variable Y conditioned on variable(s) X, the agent can maintain a count for each pair of a value for Y and a value for X. Suppose c_ij is a non-negative number that will be used as a pseudocount for Y=y_i ∧X=x_j. Suppose n_ij is the number of observed cases of Y=y_i ∧X=x_j. The agent can use

P(Y=y_i|X=x_j)=(c_ij+n_ij)/(∑_i' c_i'j+n_i'j),

but this does not work well when the denominator is small, which occurs when some values of X are rare. When X has structure - for example, when it is composed of other variables - it is often the case that some assignments to X are very rare or even do not appear in the training data. In these cases, the learner must use other methods that are discussed in this chapter.

7.2.3.1 Probabilities from Experts