### 7.2.3 Learning Probabilities

For many of the prediction measures, the optimal prediction on the training data is the empirical frequency. Thus, making a point estimate can be interpreted as learning a probability. However, the empirical frequency is typically not a good estimate of the probability of new cases; just because an agent has not observed some value of a variable does not mean that the value should be assigned a probability of zero. A probability of zero means that the value is impossible.

Typically, we do not have data without any prior knowledge. There is typically a great deal of knowledge about a domain, either in the meaning of the symbols or in experience with similar examples that can be used to improve predictions.

A standard way both to solve the zero-probability problem and to take
prior knowledge into account is to use a
**pseudocount** or **prior count**
for each value
to which the
training data is added.

Suppose there is a binary
feature *Y*, and an agent has observed *n _{0}*
cases where

*Y=0*and

*n*cases where

_{1}*Y=1*. The agent can use a pseudocount

*c*for

_{0}≥ 0*Y=0*and a pseudocount

*c*for

_{1}≥ 0*Y=1*and estimate the probability as

P(Y=1)=(n_{1}+c_{1})/(n_{0}+c_{0}+n_{1}+c_{1}).

This takes into account both the data and the prior knowledge. This formula can be justified in terms of a prior on the parameters (see Section 7.8). Choosing pseudocounts is part of designing the learner.

More generally, suppose
*Y* has domain *{y _{1},...,y_{k}}*. The agent starts with a pseudocount

*c*for each

_{i}*y*. These counts are chosen before the agent has seen any of the data. Suppose the agent observes some training examples, where

_{i}*n*is the number of data points with

_{i}*Y=y*. It can then use

_{i}P(Y=y_{i})=(c_{i}+n_{i})/(∑_{i'}c_{i'}+n_{i'}).

To determine the pseudocounts, consider the question, "How much more
should an agent believe *y _{i}* if it had seen one example with

*y*true than if it had seen no examples with

_{i}*y*true?" If, with no examples of

_{i}*y*true, the agent believes that

_{i}*y*is impossible,

_{i}*c*should be zero. If not, the ratio chosen in answer to that question should be equal to the ratio

_{i}*(1+c*. If the pseudocount is 1, a value that has been seen once would be twice as likely as one that has been seen no times. If the pseudocount is 10, a value observed once would be 10% more likely than a value observed no times. If the pseudocount is 0.1, a value observed once would be 11 times more likely than a value observed no times. If there is no reason to choose one value in the domain of

_{i}):c_{i}*Y*over another, all of the values of

*c*should be equal.

_{i}If there is no prior knowledge, Laplace(1812) suggested
that it is
reasonable to set
*c _{i}=1*. See
Section 7.8 for a justification of why this may be
reasonable. We will see examples where it is not appropriate, however.

The same idea can be used to learn a **conditional probability distribution**.
To estimate a conditional distribution, *P(Y|X)*, of variable *Y*
conditioned on variable(s) *X*, the agent can maintain a count for
each pair of a value for *Y* and a value for *X*. Suppose *c _{ij}* is a
non-negative number that will be used as a pseudocount for

*Y=y*. Suppose

_{i}∧X=x_{j}*n*is the number of observed cases of

_{ij}*Y=y*. The agent can use

_{i}∧X=x_{j}P(Y=y_{i}|X=x_{j})=(c_{ij}+n_{ij})/(∑_{i'}c_{i'j}+n_{i'j}),

but this does not work well when the denominator is
small, which occurs when some values of *X* are rare. When *X* has
structure - for example, when it is composed of other variables - it is often the
case that some assignments to *X* are very rare or even do not appear
in the training data. In these cases, the learner must use other methods
that are discussed in this chapter.