Full text of the second edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2017 is now available.

## 7.8 Bayesian Learning

Rather than choosing the most likely model or delineating the set of all models that are consistent with the training data, another approach is to compute the posterior probability of each model given the training examples.

The idea of **Bayesian learning**
is to compute the posterior probability distribution of the target
features of a new example
conditioned on its input features and all of the
training examples.

Suppose a new case has inputs *X=x* and has target
features, *Y*; the aim is to compute *P(Y|X=x∧ e)*, where

*is the set of training examples. This is the probability distribution of the target variables given the particular inputs and the examples. The role of a model is to be the assumed generator of the examples. If we let*

**e***M*be a set of disjoint and covering models, then reasoning by cases and the chain rule give

P(Y|x∧ e)= ∑ _{m∈M}P(Y ∧m |x∧e)= ∑ _{m∈M}P(Y | m ∧x∧e) ×P(m|x∧e)= ∑ _{m∈M}P(Y | m ∧x) ×P(m|e) .

The first two equalities are theorems from the definition of probability. The last
equality makes two assumptions: the model includes all of the
information about the examples that is necessary for a particular prediction
[i.e., *P(Y | m ∧x∧ e)= P(Y | m ∧x) *], and
the model does not change depending on the
inputs of the new example [i.e.,

*P(m|x∧*]. This formula says that we average over the prediction of all of the models, where each model is weighted by its posterior probability given the examples.

**e**)= P(m|**e**)*P(m| e)* can be computed using Bayes' rule:

P(m|e) = (P(e|m)×P(m))/(P(e)) .

Thus, the weight of each model depends on how well it predicts the data
(the likelihood) and its prior probability. The denominator, *P( e)*,
is a normalizing constant to make sure the posterior probabilities of
the models sum to 1. Computing

*P(*can be very difficult when there are many models.

**e**)A set *{e _{1},...,e_{k}}* of examples are

**i.i.d. (independent and identically distributed)**, where the distribution is given by model

*m*if, for all

*i*and

*j*, examples

*e*and

_{i}*e*are independent given

_{j}*m*, which means

*P(e*. We usually assume that the examples are i.i.d.

_{i}∧e_{j}|m)=P(e_{i}|m)×P(e_{j}|m)Suppose the set of training examples * e* is

*{e*. That is,

_{1},...,e_{k}}*is the conjunction of the*

**e***e*, because all of the examples have been observed to be true. The assumption that the examples are i.i.d. implies

_{i}P(e|m) = ∏_{i=1}^{k}P(e_{i}|m) .

The set of models may include structurally different models in addition to models that differ in the values of the parameters. One of the techniques of Bayesian learning is to make the parameters of the model explicit and to determine the distribution over the parameters.

**Example 7.30:**Consider the simplest learning task under uncertainty. Suppose there is a single Boolean random variable,

*Y*. One of two outcomes,

*a*and

*¬a*, occurs for each example. We want to learn the probability distribution of

*Y*given some examples.

There is a single parameter, *φ*, that determines the set of all
models. Suppose that *φ* represents the
probability of *Y=true*. We treat this parameter as a real-valued random
variable on the interval *[0,1]*. Thus, by definition of *φ*, *P(a|φ)=φ* and *P(¬a|φ)=1-φ*.

Suppose an agent has no prior information about the probability of Boolean
variable *Y* and no knowledge beyond the training examples. This ignorance can be
modeled by having the prior probability distribution of the variable
*φ* as a uniform
distribution over the interval *[0,1]*. This is the
the probability density function labeled *n _{0}=0, n_{1}=0* in
Figure 7.15.

We can update the probability distribution of *φ* given
some examples. Assume that the examples, obtained by running a number of
independent experiments, are a particular sequence of outcomes that
consists of *n _{0}* cases where

*Y*is false and

*n*cases where

_{1}*Y*is true.

The posterior distribution for *φ* given the training examples can
be derived by Bayes' rule. Let the examples * e* be the
particular sequence of observation that resulted in

*n*occurrences of

_{1}*Y=true*and

*n*occurrences of

_{0}*Y=false*. Bayes' rule gives us

P(φ|e)=(P(e|φ)×P(φ))/(P(e)) .

The denominator is a normalizing constant to make sure the area under the curve is 1.

Given that the examples are i.i.d.,

P(e|φ)=φ^{n1}×(1-φ)^{n0}

because there are
*n _{0}* cases where

*Y=false*, each with a probability of

*1-φ*, and

*n*cases where

_{1}*Y=true*, each with a probability of

*φ*.

One possible prior probability,
*P(φ)*, is a uniform distribution on the
interval *[0,1]*. This would be reasonable when the agent has no prior information about the probability.

Figure 7.15 gives some posterior distributions of the
variable *φ* based on different sample sizes, and given a uniform prior. The cases are
*(n _{0}=1, n_{1}=2)*,

*(n*, and

_{0}=2, n_{1}=4)*(n*. Each of these peak at the same place, namely at

_{0}=4, n_{1}=8)*(2)/(3)*. More training examples make the curve sharper.

The distribution of this example is known as the **beta
distribution**; it is parametrized by two
counts, *α _{0}* and

*α*, and a probability

_{1}*p*. Traditionally, the

*α*parameters for the beta distribution are one more than the counts; thus,

_{i}*α*. The beta distribution is

_{i}=n_{i}+1Beta^{α0,α1}(p)=(1)/(K) p^{α1-1}×(1-p)^{α0-1}

where *K* is a normalizing constant that ensures the integral
over all values is 1. Thus,
the uniform distribution on *[0,1]* is the beta distribution
*Beta ^{1,1}*.

The generalization of the beta distribution to more
than two parameters is known as the Dirichlet
distribution.
The **Dirichlet distribution**
with two sorts of parameters, the "counts"
*α _{1},...,α_{k}*, and the probability parameters

*p*, is

_{1},...,p_{k}Dirichlet^{α1,...,αk}(p_{1},...,p_{k}) = (1)/(K) ∏_{j=1}^{k}p_{j}^{αj-1}

where *K* is a normalizing constant that ensures the integral
over all values is 1; *p _{i}* is the probability of the

*i*th outcome (and so

*0 ≤ p*) and

_{i}≤ 1*α*is one more than the count of the

_{i}*i*th outcome. That is,

*α*. The Dirichlet distribution looks like Figure 7.15 along each dimension (i.e., as each

_{i}=n_{i}+1*p*varies between 0 and 1).

_{j}For many cases, summing over all models weighted by their posterior distribution is difficult, because the models may be
complicated (e.g., if they are decision trees or even belief
networks). However, for the Dirichlet distribution, the expected value
for outcome *i* (averaging over all *p _{j}*'s) is

(α_{i})/(∑_{j}α_{j}) .

The reason that the *α _{i}* parameters are one more than the counts is
to make this formula simple. This fraction is well
defined only when the

*α*are all non-negative and not all are zero.

_{j}**Example 7.31:**Consider Example 7.30, which determines the value of

*φ*based on a sequence of observations made up of

*n*cases where

_{0}*Y*is false and

*n*cases where

_{1}*Y*is true. Consider the posterior distribution as shown in Figure 7.15. What is interesting about this is that, whereas the most likely posterior value of

*φ*is

*(n*, the expected value of this distribution is

_{1})/(n_{0}+n_{1})*(n*.

_{1}+1)/(n_{0}+n_{1}+2)Thus, the expected value of the *n _{0}=1, n_{1}=2* curve is

*(3)/(5)*, for the

*n*case the expected value is

_{0}=2, n_{1}=4*(5)/(8)*, and for the

*n*case it is

_{0}=4, n_{1}=8*(9)/(14)*. As the learner gets more training examples, this value approaches

*(n)/(m)*.

This estimate is better than *(n)/(m)* for a number of reasons. First, it tells
us what to do if the learning agent has no examples: Use the uniform
prior of *(1)/(2)*. This is the expected value of the *n=0,
m=0* case. Second, consider the case where *n=0* and *m=3*. The agent
should not use *P(y)=0*, because this says that *Y* is impossible, and it
certainly does not have evidence for this! The expected value of this
curve with a uniform prior is *(1)/(5)*.

An agent does not have to start with a uniform prior; it can start
with any prior distribution. If the agent starts with a prior that is
a Dirichlet distribution, its posterior will be a Dirichlet
distribution. The posterior distribution can be obtained by adding the
observed counts to the *α _{i}* parameters of the prior distribution.

The i.i.d. assumption can be represented as a belief network, where
each of the *e _{i}* are independent given model

*m*. This independence assumption can be represented by the belief network shown on the left side of Figure 7.16.

If *m* is made into
a discrete variable, any of the
inference methods of the previous chapter can be used for inference in this
network. A standard reasoning technique in such a network is to
condition on all of the observed *e _{i}* and to query the model variable
or an unobserved

*e*variable.

_{i}The problem with specifying a belief network for a learning problem is that the model grows with the number
of observations. Such a network can be specified before any
observations have been received by using a **plate
model**. A plate model specifies what
variables will be used in the model and what will be repeated in the observations.
The right side of Figure 7.16 shows a plate model that
represents the same information as the left side. The plate is drawn
as a rectangle that contains some nodes, and an index (drawn on the
bottom right of the plate). The nodes in the plate are indexed by the index. In the plate model, there are multiple copies
of the variables in the plate, one for each value of the index. The
intuition is that there is a pile of plates, one for each value of
the index. The number of plates can be varied depending on the number
of observations and what is queried. In this figure, all of the nodes
in the plate share a common parent. The probability of each copy of a
variable in a plate given the parents is the same for each index.

A plate model lets us specify more complex relationships between the
variables. In a **hierarchical Bayesian
model**, the parameters of the model can
depend on other parameters. Such a model is hierarchical in the sense
that some parameters can depend on other parameters.

**Example 7.32:**Suppose a diagnostic assistant agent wants to model the probability that a particular patient in a hospital is sick with the flu before symptoms have been observed for this patient. This prior information about the patient can be combined with the observed symptoms of the patient. The agent wants to learn this probability, based on the statistics about other patients in the same hospital and about patients at different hospitals. This problem can range from the cases where a lot of data exists about the current hospital (in which case, presumably, that data should be used) to the case where there is no data about the particular hospital that the patient is in. A hierarchical Bayesian model can be used to combine the statistics about the particular hospital the patient is in with the statistics about the other hospitals.

Suppose that for patient *X* in hospital *H* there is a random
variable *S _{HX}* that is true when the patient is sick with the flu. (Assume that
the patient identification number and the hospital uniquely determine the patient.) There is a value

*φ*for each hospital

_{H}*H*that will be used for the prior probability of being sick with the flu for each patient in

*H*. In a Bayesian model,

*φ*is treated as a real-valued random variable with domain

_{H}*[0,1]*.

*S*depends on

_{HX}*φ*, with

_{H}*P(S*. Assume that

_{HX}|φ_{H})=φ_{H}*φ*is distributed according to a beta distribution. We don't assume that

_{H}*φ*and

_{hi}*φ*are independent of each other, but depend on

_{h2}**hyperparameters**. The hyperparameters can be the prior counts

*α*and

_{0}*α*. The parameters depend on the hyperparameters in terms of the conditional probability

_{1}*P(φ*;

_{hi}|α_{0},α_{1})= Beta^{α0,α1}(φ_{hi})*α*and

_{0}*α*are real-valued random variables, which require some prior distribution.

_{1}The plate model and
the corresponding belief network are shown in
Figure 7.17. Part (a) shows the plate model, where
there is a copy of the outside plate for each hospital and a copy of
the inside plate for each patient in the hospital. Part of the
resulting belief network is shown in part (b). Observing some of the *S _{HX}* will
affect the

*φ*and so

_{H}*α*and

_{0}*α*, which will in turn affect the other

_{1}*φ*variables and the unobserved

_{H}*S*variables.

_{HX}Sophisticated methods exist to evaluate such networks. However, if the variables are made discrete, any of the methods of the previous chapter can be used.

In addition to using the posterior distribution of *φ* to derive the
expected value, we can use it to answer other questions such as: What
is the probability that the posterior probability of *φ* is in
the range *[a,b]*? In other words, derive *P((φ ≥ a ∧φ ≤ b) | e)*. This is the problem that the Reverend
Thomas Bayes solved more than 200 years ago [Bayes(1763)]. The solution he gave - although in much more cumbersome notation - was

(∫_{a}^{b}p^{n}×(1-p)^{m-n})/(∫_{0}^{1}p^{n}×(1-p)^{m-n}) .

This kind of knowledge is used in surveys when it may be reported that
a survey is correct with an error of at most *5%*, *19* times out of *20*. It is also the
same type of information that is used by probably approximately correct (PAC) learning, which guarantees an error at most
*ε* at least *1-δ* of the time. If an agent chooses the
midpoint of the range *[a,b]*, namely *(a+b)/(2)*, as its
hypothesis, it will have error less than or equal to *(b-a)/(2)*,
just when the hypothesis is in *[a,b]*. The value *1-δ* corresponds to
*P(φ ≥ a ∧φ ≤ b | e)*. If
*ε=(b-a)/(2)* and *δ=1-P(φ ≥ a ∧φ ≤ b | e)*, choosing the midpoint will result in an error at most
*ε* in *1-δ* of the time. PAC learning gives worst-case results, whereas
Bayesian learning gives the expected number. Typically, the Bayesian
estimate is more accurate, but the PAC results give a guarantee of
the error. The sample complexity (see Section 7.7.2)
required for Bayesian learning is typically much less than that of
PAC learning - many fewer examples are required to *expect to achieve* the
desired accuracy than are needed to *guarantee* the desired accuracy.