# 10.4 Bayesian Learning

Rather than choosing the most likely model or delineating the set of all models that are consistent with the training data, another approach is to compute the posterior probability of each model given the training examples.

The idea of Bayesian learning is to compute the posterior probability distribution of the target features of a new example conditioned on its input features and all the training examples.

Suppose a new case has inputs $X{=}x$ (which we write simply as $x$) and target features $Y$. The aim is to compute $P(Y\mid x\wedge E)$, where $E$ is the set of training examples. This is the probability distribution of the target variables given the particular inputs and the examples. The role of a model is to be the assumed generator of the examples. If we let $M$ be a set of disjoint and covering models, then reasoning by cases and the chain rule give

 $\displaystyle P(Y\mid x\wedge E)$ $\displaystyle\mbox{}=\sum_{m\in M}P(Y\wedge m\mid x\wedge E)$ $\displaystyle\mbox{}=\sum_{m\in M}P(Y\mid m\wedge x\wedge E)*P(m\mid x\wedge E)$ $\displaystyle\mbox{}=\sum_{m\in M}P(Y\mid m\wedge x)*P(m\mid E)~{}.$

The first two equalities follow from the definition of conditional probability. The last equality relies on two assumptions: the model includes all the information about the examples that is necessary for a particular prediction, $P(Y\mid m\wedge x\wedge E)=P(Y\mid m\wedge x)$, and the model does not change depending on the inputs of the new example, $P(m\mid x\wedge E)=P(m\mid E)$. Instead of choosing the best model, Bayesian learning relies on model averaging, averaging over the predictions of all the models, where each model is weighted by its posterior probability given the training examples.

$P(m\mid E)$ can be computed using Bayes’ rule:

 $P(m\mid E)=\frac{P(E\mid m)*P(m)}{P(E)}~{}.$

Thus, the weight of each model depends on how well it predicts the data (the likelihood) and its prior probability. The denominator, $P(E)$, is a normalizing constant to make sure the posterior probabilities of the models sum to 1. $P(E)$ is called the partition function. Computing $P(E)$ may be very difficult when there are many models.

A set $\{e_{1},\dots,e_{k}\}$ of examples are independent and identically distributed (i.i.d.), given model $m$ if examples $e_{i}$ and $e_{j}$, for $i\neq j$, are independent given $m$. If the set of training examples $E$ is $\{e_{1},\dots,e_{k}\}$, the assumption that the examples are i.i.d. implies

 $P(E\mid m)=\prod_{i=1}^{k}P(e_{i}\mid m)~{}.$

The i.i.d. assumption can be represented as a belief network, shown in Figure 10.10, where each of the $e_{i}$ are independent given model $m$.

If $m$ is made into a discrete variable, any of the inference methods of the previous chapter could be used for inference in this network. A standard reasoning technique in such a network is to condition on every observed $e_{i}$ and to query the model variable or an unobserved $e_{i}$ variable.

The set of models may include structurally different models in addition to models that differ in the values of the parameters. One of the techniques of Bayesian learning is to make the parameters of the model explicit and to determine the distribution over the parameters.

###### Example 10.14.

Consider the simplest learning task of learning a single Boolean random variable, $Y$, with no input features. (This is the case covered in Section 7.2.3.) Each example specifies $Y=true$ or $Y=false$. The aim is to learn the probability distribution of $Y$ given the set of training examples.

There is a single parameter, $\phi$, that determines the set of all models. Suppose that $\phi$ represents the probability of $Y{=}true$. We treat $\phi$ as a real-valued random variable on the interval $[0,1]$. Thus, by definition of $\phi$, $P(Y=true\mid\phi)=\phi$ and $P(Y=false\mid\phi)=1-\phi$.

Suppose, first, an agent has no prior information about the probability of Boolean variable $Y$ and no knowledge beyond the training examples. This ignorance can be modeled by having the prior probability distribution of the variable $\phi$ as a uniform distribution over the interval $[0,1]$. This is the probability density function labeled $n_{0}{=}0,n_{1}{=}0$ in Figure 10.11.

We can update the probability distribution of $\phi$ given some examples. Assume that the examples, obtained by running a number of independent experiments, are a particular sequence of outcomes that consists of $n_{0}$ cases where $Y$ is false and $n_{1}$ cases where $Y$ is true.

The posterior distribution for $\phi$ given the training examples can be derived by Bayes’ rule. Let the examples $E$ be the particular sequence of observations that resulted in $n_{1}$ occurrences of $Y{=}true$ and $n_{0}$ occurrences of $Y{=}false$. Bayes’ rule gives us

 $P(\phi\mid E)=\frac{P(E\mid\phi)*P(\phi)}{P(E)}~{}.$

The denominator is a normalizing constant to make sure the area under the curve is 1.

Given that the examples are i.i.d.,

 $P(E\mid\phi)=\phi^{n_{1}}*(1-\phi)^{n_{0}}$

because there are $n_{0}$ cases where $Y{=}false$, each with a probability of $1-\phi$, and $n_{1}$ cases where $Y{=}true$, each with a probability of $\phi$.

Note that $E$ is the particular sequence of observations made. If the observation was just that there were a total of $n_{0}$ occurrences of $Y=false$ and $n_{1}$ occurrences of $Y=true$, we would get an different answer, because we would have to take into account all the possible sequences that could have given this count. The latter is known as the binomial distribution.

One possible prior probability, $P(\phi)$, is a uniform distribution on the interval $[0,1]$. This would be reasonable when the agent has no prior information about the probability.

Figure 10.11 gives some posterior distributions of the variable $\phi$ based on different sample sizes, and given a uniform prior. The cases are $(n_{0}{\,{=}\,}1,n_{1}{\,{=}\,}2)$, $(n_{0}{\,{=}\,}2,n_{1}{\,{=}\,}4)$, and $(n_{0}{\,{=}\,}4,n_{1}{\,{=}\,}8)$. Each of these peak at the same place, namely at $\frac{2}{3}$. More training examples make the curve sharper.

The distribution of this example is known as the beta distribution; it is parameterized by two counts, $\alpha_{0}$ and $\alpha_{1}$, and a probability $p$. Traditionally, the $\alpha_{i}$ parameters for the beta distribution are one more than the counts; thus, $\alpha_{i}=n_{i}+1$. The beta distribution is

 $Beta^{\alpha_{0},\alpha_{1}}(p)=\frac{1}{Z}p^{\alpha_{1}-1}*(1-p)^{\alpha_{0}-1}$

where $Z$ is a normalizing constant that ensures the integral over all values is 1. Thus, the uniform distribution on $[0,1]$ is the beta distribution $Beta^{1,1}$.

Suppose instead that $Y$ is a discrete variable with $k$ different values. The generalization of the beta distribution to cover this case is known as the Dirichlet distribution. The Dirichlet distribution with two sorts of parameters, the “counts” $\alpha_{1},\dots,\alpha_{k}$, and the probability parameters $p_{1},\dots,p_{k}$, is

 $Dirichlet^{\alpha_{1},\dots,\alpha_{k}}(p_{1},\dots,p_{k})=\frac{1}{Z}\prod_{j% =1}^{k}p_{j}^{\alpha_{j}-1}$

where $p_{i}$ is the probability of the $i$th outcome (and so $0\leq p_{i}\leq 1$) and $\alpha_{i}$ is a non-negative real and $Z$ is a normalizing constant that ensures the integral over all the probability values is 1. We can think of $a_{i}$ as one more than the count of the $i$th outcome, $\alpha_{i}=n_{i}+1$. The Dirichlet distribution looks like Figure 10.11 along each dimension (i.e., as each $p_{j}$ varies between 0 and 1).

For many cases, averaging over all models weighted by their posterior distribution is difficult, because the models may be complicated (e.g., if they are decision trees or even belief networks). For the Dirichlet distribution, the expected value for outcome $i$ (averaging over all $p_{j}$) is

 $\frac{\alpha_{i}}{\sum_{j}\alpha_{j}}~{}.$

The reason that the $\alpha_{i}$ parameters are one more than the counts in the definitions of the beta and Dirichlet distributions is to make this formula simple. This fraction is well defined only when the $\alpha_{j}$ are all non-negative and not all are zero.

###### Example 10.15.

Consider Example 10.14, which determines the value of $\phi$ based on a sequence of observations made up of $n_{0}$ cases where $Y$ is false and $n_{1}$ cases where $Y$ is true. Consider the posterior distribution as shown in Figure 10.11. What is interesting about this is that, whereas the most likely posterior value of $\phi$ is $\frac{n_{1}}{n_{0}+n_{1}}$, the expected value of this distribution is $\frac{n_{1}+1}{n_{0}+n_{1}+2}$.

Thus, the expected value of the $n_{0}{=}1,n_{1}{=}2$ curve is $\frac{3}{5}$, for the $n_{0}{=}2,n_{1}{=}4$ case the expected value is $\frac{5}{8}$, and for the $n_{0}{=}4,n_{1}{=}8$ case it is $\frac{9}{14}$. As the learner gets more training examples, this value approaches $\frac{n}{m}$.

This estimate is better than $\frac{n}{m}$ for a number of reasons. First, it tells us what to do if the learning agent has no examples: use the uniform prior of $\frac{1}{2}$. This is the expected value of the $n{=}0,m{=}0$ case. Second, consider the case where $n{=}0$ and $m{=}3$. The agent should not use $P(y){=}0$, because this says that $Y$ is impossible, and it certainly does not have evidence for this! The expected value of this curve with a uniform prior is $\frac{1}{5}$.

An agent does not have to start with a uniform prior; it could start with any prior distribution. If the agent starts with a prior that is a Dirichlet distribution, its posterior will be a Dirichlet distribution. The posterior distribution can be obtained by adding the observed counts to the $\alpha_{i}$ parameters of the prior distribution.

Thus, the beta and Dirichlet distributions provide a justification for using pseudocounts for estimating probabilities. The pseudocount represents the prior knowledge. A flat prior gives a pseudocount of 1. Thus, Laplace smoothing can be justified in terms of making predictions from initial ignorance.

In addition to using the posterior distribution of $\phi$ to derive the expected value, we can use it to answer other questions such as: What is the probability that the posterior probability, $\phi$, is in the range $[a,b]$? In other words, derive $P((\phi\geq a\wedge\phi\leq b)\mid e)$. This is the problem that the Reverend Thomas Bayes solved more than 250 years ago [Bayes, 1763]. The solution he gave – although in much more cumbersome notation – was

 $\frac{\int_{a}^{b}p^{n}*(1-p)^{m-n}}{\int_{0}^{1}p^{n}*(1-p)^{m-n}}~{}.$

This kind of knowledge is used in surveys when it may be reported that a survey is correct with an error of at most $5\%$, $19$ times out of $20$. It is also the same type of information that is used by probably approximately correct (PAC) learning, which guarantees an error at most $\epsilon$ at least $1-\delta$ of the time. If an agent chooses the midpoint of the range $[a,b]$, namely $\frac{a+b}{2}$, as its hypothesis, it will have error less than or equal to $\frac{b-a}{2}$, just when the hypothesis is in $[a,b]$. The value $1-\delta$ corresponds to $P(\phi\geq a\wedge\phi\leq b\mid e)$. If $\epsilon=\frac{b-a}{2}$ and $\delta=1-P(\phi\geq a\wedge\phi\leq b\mid e)$, choosing the midpoint will result in an error at most $\epsilon$ in $1-\delta$ of the time. PAC learning gives worst-case results, whereas Bayesian learning gives the expected number. Typically, the Bayesian estimate is more accurate, but the PAC results give a guarantee of a bound on the error. The sample complexity, the number of samples required to obtain some given accuracy, for Bayesian learning is typically much less than that of PAC learning – many fewer examples are required to expect to achieve the desired accuracy than are needed to guarantee the desired accuracy.