8.6 Stochastic Simulation

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

8.6.7 Markov Chain Monte Carlo

The previously described methods went forward through the network (parents were sampled before children), and were not good at passing information back through the network. The method described in this section can sample variables in any order.

A stationary distribution of a Markov chain is a distribution of its variables that is not changed by the transition function of the Markov chain. If the Markov chain mixes enough, there is a unique stationary distribution, which can be approached by running the Markov chain long enough. The idea behind Markov chain Monte Carlo (MCMC) methods to generate samples from a distribution (e.g., the posterior distribution given a belief network) is to construct a Markov chain with the desired distribution as its (unique) stationary distribution and then sample from the Markov chain; these samples will be distributed according to the desired distribution. We typically discard the first few samples in a burn-in period, as these samples may be far from the stationary distribution.

One way to create a Markov chain from a belief network with observations, is to use Gibbs sampling. The idea is to clamp observed variables to the values they were observed to have, and sample the other variables. Each variable is sampled from the distribution of the variable given the current values of the other variables. Note that each variable only depends on the values of the variables in its Markov blanket. The Markov blanket of a variable X in a belief network contains X’s parents, X’s children, and the other parents of X’s children; these are all of the variables that appear in factors with X.

1: procedure Gibbs_sampling(B,e,Q,n,burn_in):
2:      Inputs
3:          B: belief network
4:          e: the evidence; a variable-value assignment to some of the variables
5:          Q: query variable
6:          n: number of samples to generate
7:          burn_in: number of samples to discard initially      
8:      Output
9:          posterior distribution on Q
10:      Local
11:          array sample[var], where sample[var]domain(var)
12:          real array counts[k] for kdomain(Q), initialized to 0      
13:      initialize sample[X]=e[X] if X observed, otherwise assign randomly
14:      repeat burn_in times
15:          for each non-observed variable X, in any order do
16:               sample[X]:= a random sample from P(Xmarkov_blanket(X))               
17:      repeat n times
18:          for each non-observed variable X, in any order do
19:               sample[X]:= a random sample from P(Xmarkov_blanket(X))          
20:          v:=sample[Q]
21:          counts[v]:=counts[v]+1      
22:      return counts/vcounts[v]
Figure 8.32: Gibbs sampling for belief network inference

Figure 8.32 gives pseudocode for Gibbs sampling. The only ill-defined part is to randomly sample P(Xmarkov_blanket(X)). This can be computed by noticing that for each value of X, the probability P(Xmarkov_blanket(X)) is the product of the values of the factors in which X appears projected onto the current value of all of the other variables.

Gibbs sampling will approach the correct probabilities as long as there are no zero probabilities. How quickly it approaches the distribution depends on how quickly the probabilities mix (how much of the probability space is explored), which depends on how extreme the probabilities are. Gibbs sampling works well when the probabilities are not extreme.

Example 8.46.

As a problematic case for Gibbs sampling, consider a simple example with three Boolean variables A, B, C, with A as the parent of B, and B as the parent of C. Suppose P(a)=0.5, P(ba)=0.99, P(b¬a)=0.01, P(cb)=0.99, P(c¬b)=0.01. There are no observations and the query variable is C. The two assignments with all variables having the same value are equally likely and are much more likely than the other assignments. Gibbs sampling will quickly get to one of these assignments, and will take a long time to transition to the other assignments (as it requires some very unlikely choices). If 0.99 and 0.01 were replaced by numbers closer to 1 and 0, it would take even longer to converge.