#### 6.4.2.5 Importance Sampling

Instead of creating a sample and then rejecting it, it is possible to mix sampling with inference to reason about the probability that a sample would be rejected. In importance sampling, each sample has a weight, and the sample average is computed using the weighted average of samples. The weights of samples come from two sources:

- The samples do not have to be selected in proportion to their
probability, but they can be selected according to some other
distribution, called the
**proposal distribution**. - Evidence is used to update the weights and is used to compute the probability that a sample would be rejected.

**Example 6.25:**Consider variable

*A*with no parents; variable

*E*has

*A*as its only parent, but

*A*has other children. Suppose

*P(e|a)=0.003*,

*P(e|¬a)=0.63*, and

*e*is observed. Consider the samples with

*A=true*. Out of 1,000 such samples, only about 3 will not be rejected. Instead of rejecting 99.7% of the samples with

*A=true*, each sample with

*A=true*can be weighted by 0.003. Thus, just one sample is able to convey the information of many rejections.

Suppose *P(a)=0.98*. If the algorithm samples according to the
probability, *A=false* would only be true in about 20 samples out of
1,000. Instead of sampling according to the probability, suppose
*A=true* is sampled 50% of the time, but each sample is weighted as follows. Each sample
with *A=true* can be weighted by *0.98/0.5=1.96* and each sample with
*A=false* can be weighted by
*0.02/0.5=0.04*. It is easy to show that the weighted sample average
is the same as the probability.

In rejection sampling, given the preceding probabilities and the
observation of *e*, *A* will be true in 98% of the samples and
99.7% of these will be rejected due to the evidence
of *e*. *A=false* would be selected in 2% of the samples and
37% of these will be rejected. Rejection sampling would thus accept only *0.98×0.003+0.02×0.63 = 0.01554* of the samples and would reject more than 98%
of the samples.

If you combine the ideas in the first two paragraphs of this
example, half of the examples will have *A=true*, and these will be
weighted by *1.96×0.003=0.00588*, and the other half of the
samples will have *A=false* with a weighting of
*0.04×0.63=0.0252*. Even two such samples convey the
information of many rejected samples.

Importance sampling differs from rejection sampling in two ways:

- Importance sampling does not sample all variables, only some of
them. The variables that are not sampled and are not observed are summed out (i.e, some
exact inference is carried out).
In particular, you probably do not want it to sample the observed variables (although the algorithm does not preclude this). If all of the non-observed variables are sampled, it is easy to determine the probability of a sample given the evidence (see Exercise 6.10).

- Importance sampling does not have to sample the variables according to their prior
probability; it can sample them using any distribution. The
distribution that it uses to sample the variables is called the
**proposal distribution**. Any distribution can be used as a proposal distribution as long as the proposal distribution does not have a zero probability for choosing some sample that is possible in the model (otherwise this part of the space will never be explored). Choosing a good proposal distribution is non-trivial.

In general, to sum over variables *S* from a product *f(S)q(S)*, you
can choose a set of samples *{s _{1},...,s_{N}}* from the distribution

*q(s)*. Then

∑ _{S}f(S) q(S)= lim _{N→∞}1/N∑_{si}f(s_{i}),

which essentially computes the expected value of *f(S)*, where the expectation is over the distribution *q(S)*.

In forward sampling, *q(s)* is the uniform sample, but Equation (6.25) works for
any distribution.

In importance sampling, let *S* be the set of variables that will be
sampled. As in VE, we introduce some variables
and sum them out; in this case, we sum over the sampled variables:

P(h|e) = ∑_{S}P(h|S, e) P(S|e).

Multiplying the top and the bottom by proposal distribution
*q(S)* gives

P(h|e) = ∑_{S}(P(h|S, e) P(S|e) q(S))/(q(S)).

Note that this does not give a divide-by-zero error; if *q(s)=0*,
*s* would never be chosen as a sample.

Using Equation (6.25), suppose *{s _{1},...,s_{N}}* is the set
of all samples:

P(h|e)=lim_{N→∞}∑_{si}( P(h|s_{i}, e) P(s_{i}|e))/(q(s_{i})).

Using Bayes' rule on *P(s _{i}|e)*, and noting that

*P(e)*is a constant, gives

P(h|e) =lim_{n→∞}1/k ∑_{si}( P(h|s_{i}, e) P(e|s_{i})P(s_{i}))/q(s_{i}),

where *k* is a normalizing constant that ensures that the posterior probabilities of the values for a mutually exclusive and covering set of hypotheses sum to 1.

Thus, for each sample, the weighting *P(s _{i})/q(s_{i})* acts like a prior
that is multiplied
by the probability of the evidence, given the
sample, to get the weight of the sample. Given many samples, the preceding
formula shows how to predict the
posterior on any

*h*by getting the weighted average prediction of each sample.

Note how importance sampling generalizes rejection sampling. Rejection
sampling is the case with *q(s _{i})=P(s_{i})* and

*S*includes all of the variables, including the observed variables.

**Procedure**IS_BN(

*Vs,P,e,Q,*)

**S**,q,n2:

**Inputs**

3:

*Vs*: set of variables

4:

*P*: a procedure to compute conditional probabilities

5:

*e*: the evidence; an assignment of a value to some of the variables

6:

*Q*: a query variable

7:

*: a set of variables to sample*

**S**={S_{1},...,S_{k}}8:

*q*: a distribution on

*(the proposal distribution)*

**S**9:

*n*: number of samples to generate

**Output**

10: posterior distribution on

*Q*

11:

**Local**

12: array

*v[k]*, where

*v[i]∈dom(S*

_{i})13: real array

*ans[m]*where

*m*is the size of

*dom(Q)*

14:

*s*assignment of a value to each element of

*S*

15:

*mass←0*

16:

**repeat**

*n*

**times**

17:

**for**

*i=1:k*

**do**

18:

**select**

*v*using distribution

_{i}∈dom(S_{i})*q(S*

_{i}=v_{i}|S_{0}=v_{0},...,S_{i-1}=v_{i-1})19:

20:

*s ←*assignment

*S*

_{0}=v_{0},...,S_{k}=v_{k}21:

*p ←P(e|s)×P(s)/q(s)*

22:

**for each**

*v*

_{i}∈dom(Q)**do**

23:

*ans[i]←ans[i]+P(Q=v*

_{i}| s ∧e)×p24:

25:

*mass←mass+p*

26:

27:

**return**ans[]/mass

Figure 6.12 shows the details of the
importance sampling
algorithm for computing *P(Q|e)* for query variable *Q* and evidence
*e*. The first *for* loop
(line 17) creates the
sample *s* on *S*. The variable *p* (on
line 21) is the weight
of sample *s*. The algorithm updates the weight of each value of
the query variable and adds the probability of the sample *s* to the
variable *mass*, which represents the **probability mass** - the
sum of probabilities for all of the values for the query
variable. Finally, it returns the probability by dividing the weight of
each value of the query variable by the probability mass.

**Example 6.26:**Suppose we want to use importance sampling to compute

*P(alarm|smoke ∧report)*. We must choose which variables to sample and the proposal distribution. Suppose we sample

*Tampering*,

*Fire*, and

*Leaving*, and we use the following proposal distribution:

*q(tampering)=0.02*

*q(fire)=0.5*

*q(Alarm|Tampering,Fire)=P(Alarm|Tampering,Fire)*

*q(Leaving|Alarm)=P(Leaving|Alarm)*

Thus, the proposal distribution is
the same as the original distribution, except for *Fire*.

The following table gives a few samples. In this table, *s* is the sample; *e* is *smoke
∧report*; *P(e|s)* is equal to *P(smoke|Fire)×P(report|Leaving)*, where the value for *Fire* and *Leaving* are from
the sample; *P(s)/q(s)* is 0.02 when *Fire=true* in the sample and is
1.98 when *Fire=false*; *p=P(e|s)×P(s)/q(s)* is the weight of
the sample.

Tampering | Fire | Alarm | Leaving | P(e|s) | P(s)/q(s) | p |

false | true | false | true | 0.675 | 0.02 | 0.0135 |

true | true | true | false | 0.009 | 0.02 | 0.00018 |

false | false | false | true | 0.0075 | 1.98 | 0.01485 |

false | true | false | false | 0.009 | 0.02 | 0.00018 |

*P(alarm|smoke
∧report)* is the weighted proportion of the samples that have *Alarm* true.

The efficiency of this algorithm, in terms of how accuracy depends on the run time, depends on

- the proposal distribution. To get the best result, the proposal distribution should be as close as possible to the posterior distribution. However, an agent typically cannot sample directly from the posterior distribution; if it could, it could produce posterior probabilities much more simply.
- which variables to sample. Sampling fewer variables means that there is more information from each sample, but each sample requires more time to compute the probability of the sample.

Determining the proposal distribution and which variables to sample is an art.