8.6.5 Importance Sampling

Likelihood weighting is an instance of importance sampling. Importance sampling algorithms have the following characteristics:

•

Samples are weighted.
•

The samples do not need to come from the actual distribution, but can be from (almost) any distribution, with the weights adjusted to reflect the difference between the distributions.
•

Some variables can be summed out and some sampled.

This freedom to sample from a different distribution allows the algorithm to choose better sampling distributions to give better estimates.

Stochastic simulation can be used to compute the expected value of real-valued variable $f$ under probability distribution $P$ using:

	$\displaystyle\mathcal{E}_{P}(f)$	$\displaystyle=\sum_{w}f(w)*P(w)$
		$\displaystyle\approx\frac{1}{n}\sum_{s}f(s)$

where $s$ is a sample that is sampled with probability $P$ , and $n$ is the number of samples. The estimate gets more accurate as the number of samples grows.

Suppose it is difficult to sample with the distribution $P$ , but it is easy to sample from a distribution $Q$ . We adapt the previous equation to estimate the expected value from $P$ , by sampling from $Q$ using:

	$\displaystyle\mathcal{E}_{P}(f)$	$\displaystyle=\sum_{w}f(w)*P(w)$
		$\displaystyle=\sum_{w}f(w)(P(w)/Q(w))Q(w)$
		$\displaystyle\approx\frac{1}{n}\sum_{s}f(s)*P(s)/Q(s)$

where the last sum is over $n$ samples selected according the distribution $Q$ . The distribution $Q$ is called a proposal distribution. The only restriction on $Q$ is that it should not be zero for any cases where $P$ is not zero (i.e., if $Q(c)=0$ then $P(c)=0$ ).

Recall that for Boolean variables, with $t r u e$ represented as 1 and $f a l s e$ as 0, the expected value is the probability. So the methods here can be used to compute probabilities.

The algorithm of Figure 8.30 can be adapted to use a proposal distribution as follows: in line 20, it should sample from $Q(X\mid parents(X))$ , and in a new line after that, it updates the value of $w e i g h t$ by multiplying it by $P(X\mid parents(X))/Q(X\mid parents(X))$ .

Example 8.44.

In the running alarm example, $P(smoke)=0.0189$ . As explained in Example 8.42, if the algorithm samples according to the prior probability, $Smoke{=}true$ would only be true in about 19 samples out of 1000. Likelihood weighting ended up with a few samples with high weights and many samples with low weights, even though the samples represented similar number of cases.

Suppose, instead of sampling according to the probability, the proposal distribution with $Q(fire)=0.5$ is used. Then $Fire{=}true$ is sampled 50% of the time. According to the model $P(fire)=0.01$ , thus each sample with $Fire{=}true$ is weighted by $0.01/0.5=0.02$ and each sample with $Fire{=}false$ is weighted by $0.99/0.5=1.98$ .

With importance sampling with $Q$ as the proposal distribution, half of the samples will have $Fire{=}true$ , and the model specifies $P(smoke\mid fire)=0.9$ . Given the evidence $e$ , these will be weighted by $0.9*0.02=0.018$ . The other half of the samples will have $A{=}false$ , and the model specifies $P(smoke\mid\neg fire)=0.01$ . These samples will have a weighting of $0.01*1.98=0.0198$ . Notice how all of the samples have weights of the same order of magnitude. This means that the estimates from these are much more accurate.

Importance sampling can also be combined with exact inference. Not all variables need to be sampled. Those not sampled can be summed out by variable elimination.

The best proposal distribution is one where the samples have approximately equal weight. This occurs when sampling from the posterior distribution. In adaptive importance sampling the proposal distribution is modified to approximate the posterior probability of the variable being sampled.

Artificial Intelligence 2E

8.6.5 Importance Sampling

Example 8.44.

Artificial
Intelligence 2E