Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

6.4.2.5 Importance Sampling

Instead of creating a sample and then rejecting it, it is possible to mix sampling with inference to reason about the probability that a sample would be rejected. In importance sampling, each sample has a weight, and the sample average is computed using the weighted average of samples. The weights of samples come from two sources:

  • The samples do not have to be selected in proportion to their probability, but they can be selected according to some other distribution, called the proposal distribution.
  • Evidence is used to update the weights and is used to compute the probability that a sample would be rejected.
Example 6.25: Consider variable A with no parents; variable E has A as its only parent, but A has other children. Suppose P(e|a)=0.003, P(e|¬a)=0.63, and e is observed. Consider the samples with A=true. Out of 1,000 such samples, only about 3 will not be rejected. Instead of rejecting 99.7% of the samples with A=true, each sample with A=true can be weighted by 0.003. Thus, just one sample is able to convey the information of many rejections.

Suppose P(a)=0.98. If the algorithm samples according to the probability, A=false would only be true in about 20 samples out of 1,000. Instead of sampling according to the probability, suppose A=true is sampled 50% of the time, but each sample is weighted as follows. Each sample with A=true can be weighted by 0.98/0.5=1.96 and each sample with A=false can be weighted by 0.02/0.5=0.04. It is easy to show that the weighted sample average is the same as the probability.

In rejection sampling, given the preceding probabilities and the observation of e, A will be true in 98% of the samples and 99.7% of these will be rejected due to the evidence of e. A=false would be selected in 2% of the samples and 37% of these will be rejected. Rejection sampling would thus accept only 0.98×0.003+0.02×0.63 = 0.01554 of the samples and would reject more than 98% of the samples.

If you combine the ideas in the first two paragraphs of this example, half of the examples will have A=true, and these will be weighted by 1.96×0.003=0.00588, and the other half of the samples will have A=false with a weighting of 0.04×0.63=0.0252. Even two such samples convey the information of many rejected samples.

Importance sampling differs from rejection sampling in two ways:

  • Importance sampling does not sample all variables, only some of them. The variables that are not sampled and are not observed are summed out (i.e, some exact inference is carried out).

    In particular, you probably do not want it to sample the observed variables (although the algorithm does not preclude this). If all of the non-observed variables are sampled, it is easy to determine the probability of a sample given the evidence (see Exercise 6.10).

  • Importance sampling does not have to sample the variables according to their prior probability; it can sample them using any distribution. The distribution that it uses to sample the variables is called the proposal distribution. Any distribution can be used as a proposal distribution as long as the proposal distribution does not have a zero probability for choosing some sample that is possible in the model (otherwise this part of the space will never be explored). Choosing a good proposal distribution is non-trivial.

In general, to sum over variables S from a product f(S)q(S), you can choose a set of samples {s1,...,sN} from the distribution q(s). Then

S f(S) q(S) = limN→∞ 1/N∑si f(si),

which essentially computes the expected value of f(S), where the expectation is over the distribution q(S).

In forward sampling, q(s) is the uniform sample, but Equation (6.25) works for any distribution.

In importance sampling, let S be the set of variables that will be sampled. As in VE, we introduce some variables and sum them out; in this case, we sum over the sampled variables:

P(h|e) = ∑S P(h|S, e) P(S|e).

Multiplying the top and the bottom by proposal distribution q(S) gives

P(h|e) = ∑S (P(h|S, e) P(S|e) q(S))/(q(S)).

Note that this does not give a divide-by-zero error; if q(s)=0, s would never be chosen as a sample.

Using Equation (6.25), suppose {s1,...,sN} is the set of all samples:

P(h|e)=limN→∞si ( P(h|si, e) P(si|e))/(q(si)).

Using Bayes' rule on P(si|e), and noting that P(e) is a constant, gives

P(h|e) =limn→∞ 1/k ∑si ( P(h|si, e) P(e|si)P(si))/q(si),

where k is a normalizing constant that ensures that the posterior probabilities of the values for a mutually exclusive and covering set of hypotheses sum to 1.

Thus, for each sample, the weighting P(si)/q(si) acts like a prior that is multiplied by the probability of the evidence, given the sample, to get the weight of the sample. Given many samples, the preceding formula shows how to predict the posterior on any h by getting the weighted average prediction of each sample.

Note how importance sampling generalizes rejection sampling. Rejection sampling is the case with q(si)=P(si) and S includes all of the variables, including the observed variables.


1: Procedure IS_BN(Vs,P,e,Q,S,q,n)
2:           Inputs
3:                     Vs: set of variables
4:                     P: a procedure to compute conditional probabilities
5:                     e: the evidence; an assignment of a value to some of the variables
6:                     Q: a query variable
7:                     S={S1,...,Sk}: a set of variables to sample
8:                     q: a distribution on S (the proposal distribution)
9:                     n: number of samples to generate Output
10:                     posterior distribution on Q
11:           Local
12:                     array v[k], where v[i]∈dom(Si)
13:                     real array ans[m] where m is the size of dom(Q)
14:                     s assignment of a value to each element of S
15:           mass←0
16:           repeat n times
17:                     for i=1:k do
18:                               select vi∈dom(Si) using distribution q(Si=vi|S0=v0,...,Si-1=vi-1)
19:                     
20:                     s ← assignment S0=v0,...,Sk=vk
21:                     p ←P(e|s)×P(s)/q(s)
22:                     for each vi∈dom(Q) do
23:                     ans[i]←ans[i]+P(Q=vi| s ∧e)×p
24:                     
25:                     mass←mass+p
26:           
27:           return ans[]/mass
Figure 6.12: Importance sampling for belief network inference

Figure 6.12 shows the details of the importance sampling algorithm for computing P(Q|e) for query variable Q and evidence e. The first for loop (line 17) creates the sample s on S. The variable p (on line 21) is the weight of sample s. The algorithm updates the weight of each value of the query variable and adds the probability of the sample s to the variable mass, which represents the probability mass - the sum of probabilities for all of the values for the query variable. Finally, it returns the probability by dividing the weight of each value of the query variable by the probability mass.

Example 6.26: Suppose we want to use importance sampling to compute P(alarm|smoke ∧report). We must choose which variables to sample and the proposal distribution. Suppose we sample Tampering, Fire, and Leaving, and we use the following proposal distribution:
q(tampering)=0.02
q(fire)=0.5
q(Alarm|Tampering,Fire)=P(Alarm|Tampering,Fire)
q(Leaving|Alarm)=P(Leaving|Alarm)

Thus, the proposal distribution is the same as the original distribution, except for Fire.

The following table gives a few samples. In this table, s is the sample; e is smoke ∧report; P(e|s) is equal to P(smoke|Fire)×P(report|Leaving), where the value for Fire and Leaving are from the sample; P(s)/q(s) is 0.02 when Fire=true in the sample and is 1.98 when Fire=false; p=P(e|s)×P(s)/q(s) is the weight of the sample.

Tampering Fire Alarm Leaving P(e|s) P(s)/q(s) p
false true false true 0.675 0.02 0.0135
true true true false 0.009 0.02 0.00018
false false false true 0.0075 1.98 0.01485
false true false false 0.009 0.02 0.00018

P(alarm|smoke ∧report) is the weighted proportion of the samples that have Alarm true.

The efficiency of this algorithm, in terms of how accuracy depends on the run time, depends on

  • the proposal distribution. To get the best result, the proposal distribution should be as close as possible to the posterior distribution. However, an agent typically cannot sample directly from the posterior distribution; if it could, it could produce posterior probabilities much more simply.
  • which variables to sample. Sampling fewer variables means that there is more information from each sample, but each sample requires more time to compute the probability of the sample.

Determining the proposal distribution and which variables to sample is an art.