Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

### 6.4.1 Variable Elimination for Belief Networks

This section gives an algorithm for finding the posterior distribution for a variable in an arbitrarily structured belief network. Many of the efficient exact methods can be seen as optimizations of this algorithm. This algorithm can be seen as a variant of variable elimination (VE) for constraint satisfaction problems (CSPs) or VE for soft constraints.

The algorithm is based on the notion that a belief network
specifies a factorization of the joint probability
distribution.
Before we give the algorithm, we define factors and the
operations that will be performed on them. Recall that *P(X|Y)* is a
function from variables (or sets of variables) *X* and *Y* into the
real numbers that, given a value for *X* and a value for *Y*, gives the
conditional probability of the value for *X*, given the value for
*Y*. This idea of a function of variables is generalized as the notion
of a factor. The VE algorithm for belief networks
manipulates factors to compute posterior probabilities.

A **factor** is a representation of a function from
a tuple of random variables into a number. We will write factor *f*
on variables *X _{1},...,X_{j}* as

*f(X*. The variables

_{1},...,X_{j})*X*are the variables

_{1},...,X_{j}**of**the factor

*f*, and

*f*is a factor

**on**

*X*.

_{1},...,X_{j}Suppose *f(X _{1},...,X_{j})* is a factor and each

*v*is an element of the domain of

_{i}*X*.

_{i}*f(X*is a number that is the value of

_{1}=v_{1},X_{2}=v_{2},...,X_{j}=v_{j})*f*when each

*X*has value

_{i}*v*. Some of the variables of a factor can be assigned to values to make a new factor on the other variables. For example,

_{i}*f(X*, sometimes written as

_{1}=v_{1},X_{2},...,X_{j})*f(X*, where

_{1},X_{2},...,X_{j})_{X1=v1}*v*is an element of the domain of variable

_{1}*X*, is a factor on

_{1}*X*.

_{2},...,X_{j}**Representations of Conditional Probabilities and Factors**

A conditional probability distribution can be seen as a function of the variables involved. A factor is a representation of a function on variables; thus, a factor can be used to represent conditional probabilities.

When the variables have finite domains, these factors can be implemented as arrays. If there is an ordering of the variables (e.g., alphabetical) and the values in the domains are mapped into the non-negative integers, there is a canonical representation of each factor as a one-dimensional array that is indexed by natural numbers. Operations on factors can be implemented efficiently using such a representation. However, this form is not a good one to present to users because the structure of the conditional is lost. Factors do not have to be implemented as arrays. The tabular representation is often too large when there are many parents. Often, more structure exists in conditional probabilities that can be exploited.

One such structure exploits **context-specific independence**, where
one variable is conditionally independent of another, given a particular value of the
third variable. For example, suppose the robot can
go outside or get coffee. Whether it gets wet depends on whether there is
rain in the context that it went out or on whether the cup was full
if it got coffee. There are a number of ways to represent the
conditional probability *P(Wet|Out,Rain,Full)* - for example
as a decision tree, as rules with probabilities, or as tables with contexts:

wet ←out ∧rain: 0.8 wet ←out ∧∼rain: 0.1 wet ←∼out ∧full: 0.6 wet ←∼out ∧∼full: 0.3

out:

Rain Wet Prob t t 0.8 t f 0.2 f t 0.1 f f 0.9 ∼out:

Full Wet Prob t t 0.6 t f 0.4 f t 0.3 f f 0.7

Another common representation is a **noisy or**. For
example, suppose the robot can get wet from rain,
coffee, sprinkler, or kids. There can be a probability that it gets
wet from rain if it rains, and a probability that it gets wet from
coffee if it has coffee, and so on (these probabilities give the
noise). The robot gets wet if it gets wet from one
of them, giving the "or".

The next chapter explores other representations that can be used for conditional probabilities.

**Example 6.17:**Figure 6.5 shows a factor

*r(X,Y,Z)*on variables

*X*,

*Y*and

*Z*as a table. This assumes that each variable is binary with domain

*{t,f}*. The figure gives a table for the factor

*r(X=t,Y,Z)*, which is a factor on

*Y,Z*. Similarly,

*r(X=t,Y,Z=f)*is a factor on

*Y*, and

*r(X=t,Y=f,Z=f)*is a number.

*r(X,Y,Z)*=

X | Y | Z | val |

t | t | t | 0.1 |

t | t | f | 0.9 |

t | f | t | 0.2 |

t | f | f | 0.8 |

f | t | t | 0.4 |

f | t | f | 0.6 |

f | f | t | 0.3 |

f | f | f | 0.7 |

*r(X=t,Y,Z)*=

Y | Z | val |

t | t | 0.1 |

t | f | 0.9 |

f | t | 0.2 |

f | f | 0.8 |

*r(X=t,Y,Z=f)*=

Y | val |

t | 0.9 |

f | 0.8 |

*r(X=t,Y=f,Z=f)=0.8*

Factors can be multiplied together. Suppose *f _{1}* and

*f*are factors, where

_{2}*f*is a factor that contains variables

_{1}*X*and

_{1},...,X_{i}*Y*, and

_{1},...,Y_{j}*f*is a factor with variables

_{2}*Y*and

_{1},...,Y_{j}*Z*, where

_{1},...,Z_{k}*Y*are the variables in common to

_{1},...,Y_{j}*f*and

_{1}*f*. The

_{2}**product**of

*f*and

_{1}*f*, written

_{2}*f*, is a factor on the union of the variables, namely

_{1}×f_{2}*X*, defined by:

_{1},...,X_{i},Y_{1},...,Y_{j},Z_{1},...,Z_{k}

(f _{1}×f_{2})(X_{1},...,X_{i},Y_{1},...,Y_{j},Z_{1},...,Z_{k})= f _{1}(X_{1},...,X_{i},Y_{1},...,Y_{j}) ×f_{2}(Y_{1},...,Y_{j},Z_{1},...,Z_{k}).

f= _{1}
| |||||||||||||||

f= _{2}
| |||||||||||||||

*f _{1} ×f_{2}*=

A | B | C | val |

t | t | t | 0.03 |

t | t | f | 0.07 |

t | f | t | 0.54 |

t | f | f | 0.36 |

f | t | t | 0.06 |

f | t | f | 0.14 |

f | f | t | 0.48 |

f | f | f | 0.32 |

**Example 6.18:**Figure 6.6 shows the product of

*f*and

_{1}(A,B)*f*, which is a factor on

_{2}(B,C)*A,B,C*. Note that

*(f*.

_{1}×f_{2})(A=t,B=f,C=f)=f_{1}(A=t,B=f)×f_{2}(B=f,C=f) = 0.9 ×0.4 = 0.36The remaining operation is to
sum out a variable in a factor. Given factor *f(X _{1},...,X_{j})*,
summing out a variable, say

*X*, results in a factor on the other variables,

_{1}*X*, defined by

_{2},...,X_{j}(∑_{X1}f)(X_{2},...,X_{j}) = f(X_{1}=v_{1},X_{2},...,X_{j})+ ···+ f(X_{1}=v_{k},X_{2}...,X_{j}),

where *{v _{1},...,v_{k}}* is the set of possible values of variable

*X*.

_{1}*f*=

_{3}
A | B | C | val |

t | t | t | 0.03 |

t | t | f | 0.07 |

t | f | t | 0.54 |

t | f | f | 0.36 |

f | t | t | 0.06 |

f | t | f | 0.14 |

f | f | t | 0.48 |

f | f | f | 0.32 |

*∑ _{B} f_{3}*=

A | C | val |

t | t | 0.57 |

t | f | 0.43 |

f | t | 0.54 |

f | f | 0.46 |

**Example 6.19:**Figure 6.7 gives an example of summing out variable

*B*from a factor

*f*, which is a factor on

_{3}(A,B,C)*A,C*. Notice how

(∑ _{B}f_{3})(A=t,C=f)= f _{3}(A=t,B=t,C=f)+f_{3}(A=t,B=f,C=f)= 0.07+0.36 = 0.43

A conditional probability distribution *P(X|Y _{1},...,Y_{j})* can
be seen as a factor

*f*on

*X,Y*, where

_{1},...,Y_{j}f(X=u,Y_{1}=v_{1},...,Y_{j}=v_{j})=P(X=u|Y_{1}=v_{1}∧···∧Y_{j}=v_{j}).

Usually, humans prefer the *P(·| ·)* notation, but internally
the computer just treats conditional probabilities as factors.

The **belief network inference problem** is the problem of computing the
posterior distribution of a variable, given some evidence.

The problem of computing posterior probabilities can be reduced to the
problem of computing the probability of conjunctions. Given evidence
*Y _{1}=v_{1}*, ...,

*Y*, and query variable

_{j}=v_{j}*Z*:

P(Z|Y _{1}=v_{1},...,Y_{j}=v_{j})= ( P(Z,Y _{1}=v_{1},...,Y_{j}=v_{j}))/(P(Y_{1}=v_{1},...,Y_{j}=v_{j}))= ( P(Z,Y _{1}=v_{1},...,Y_{j}=v_{j}))/(∑_{z}P(Z,Y_{1}=v_{1},...,Y_{j}=v_{j})).

So the agent computes the factor
*P(Z,Y _{1}=v_{1},...,Y_{j}=v_{j})* and normalizes. Note that this is a
factor only of

*Z*; given a value for

*Z*, it returns a number that is the probability of the conjunction of the evidence and the value for

*Z*.

Suppose the variables of the belief network are *X _{1},...,X_{n}*.
To compute the factor

*P(Z,Y*, sum out the other variables from the joint distribution. Suppose

_{1}=v_{1},...,Y_{j}=v_{j})*Z*is an enumeration of the other variables in the belief network - that is,

_{1},...,Z_{k}{Z_{1},...,Z_{k}} = {X_{1},...,X_{n}} - {Z} - {Y_{1},...,Y_{j}}.

The probability of *Z* conjoined with the evidence is

p(Z,Y_{1}=v_{1},...,Y_{j}=v_{j})=∑_{Zk}···∑_{Z1}P(X_{1},...,X_{n})_{Y1=v1,...,Yj=vj}.

The order that the variables *Z _{i}* are summed out is an

**elimination ordering**.

Note how this is related to the possible worlds semantics of probability. There is a possible world for each
assignment of a value to each variable. The **joint probability
distribution**,
*P(X _{1},...,X_{n})*, gives the probability (or measure) for each
possible world. The VE algorithm thus selects the worlds with the
observed values for the

*Y*'s and sums over the possible worlds with the same value for

_{i}*Z*. This corresponds to the definition of conditional probability. However, VE does this more efficiently than by summing over all of the worlds.

By the rule for conjunction of probabilities and the definition of a belief network,

P(X _{1},...,X_{n}) = P(X_{1}|parents(X_{1}))×···×P(X_{n}|parents(X_{n})),

where *parents(X _{i})* is the set of parents of variable

*X*.

_{i}We have now reduced the belief network inference problem to a
problem of summing out a set of variables from a product of factors.
To solve this problem efficiently, we
use the distribution law learned in high school: to compute a
sum of products such as
*xy+xz* efficiently, distribute out the common factors (here
*x*), which results in *x(y+z)*. This is the essence of the VE algorithm. We
call the elements multiplied together "factors" because of the use
of the term in algebra. Initially, the factors represent the
conditional probability distributions, but the intermediate factors are just
functions on variables that are created by adding and multiplying
factors.

To compute the posterior distribution of a query variable given observations:

- Construct a factor for each conditional probability distribution.
- Eliminate each of the non-query variables:
- if the variable is observed, its value is set to the observed value in each of the factors in which the variable appears,
- otherwise the variable is summed out.

- Multiply the remaining factors and normalize.

To sum out a variable *Z* from a product *f _{1},...,f_{k}* of factors,
first partition the factors into those that do not contain

*Z*, say

*f*, and those that contain

_{1},...,f_{i}*Z*,

*f*; then distribute the common factors out of the sum:

_{i+1},...,f_{k}∑_{Z}f_{1}×···×f_{k}= f_{1}×···×f_{i}×(∑_{Z}f_{i+1}×···×f_{k}).

VE explicitly constructs a representation (in terms of a multidimensional array, a tree, or a set of rules) of the rightmost factor.

**Procedure**VE_BN(

*Vs,Ps,O,Q*)

2:

**Inputs**

3:

*Vs*: set of variables

4:

*Ps*: set of factors representing the conditional probabilities

5:

*O*: set of observations of values on some of the variables

6:

*Q*: a query variable

7:

**Output**

8: posterior distribution on

*Q*

9:

**Local**

10:

*Fs*: a set of factors

11:

*Fs ←Ps*

12:

**for each**

*X∈Vs-{Q}*using some elimination ordering

**do**

13:

**if**(

*X*is observed)

**then**

14:

**for each**

*F ∈Fs*that involves

*X*

**do**

15:

**set**

*X*in

*F*to its observed value in

*O*

16: project

*F*onto remaining variables

17:

**else**

18:

*Rs←{F ∈Fs: F involves X}*

19: let

*T*be the product of the factors in

*Rs*

20:

*N ←∑*

_{X}T21:

*Fs←Fs \ Rs∪{N}*

22: let

*T*be the product of the factors in

*Fs*

23:

*N ←∑*

_{Q}T24:

**return**

*T/N*

Figure 6.8 gives pseudocode for the VE algorithm. The elimination ordering can be given a priori or can be computed on the fly. It is worthwhile to select observed variables first in the elimination ordering, because eliminating these simplifies the problem.

This assumes that the query variable is not observed. If it is observed to have a particular value, its posterior probability is just 1 for the observed value and 0 for the other values.

**Example 6.20:**Consider Example 6.10 with the query

P(Tampering|Smoke=true ∧Report=true).

After eliminating the observed variables, *Smoke* and *Report*, the following factors remain:

Conditional Probability Factor P(Tampering) f _{0}(Tampering)P(Fire) f _{1}(Fire)P(Alarm|Tampering,Fire) f _{2}(Tampering, Fire, Alarm)P(Smoke=yes|Fire) f _{3}(Fire)P(Leaving|Alarm) f _{4}(Alarm, Leaving)P(Report=yes|Leaving) f _{5}(Leaving)

The algorithm ignores the conditional probability reading and just works with the factors. The intermediate factors do not always represent a conditional probability.

Suppose *Fire* is selected next in the elimination ordering.
To eliminate *Fire*, collect all of the factors containing *Fire*
- *f _{1}(Fire)*,

*f*, and

_{2}(Tampering, Fire, Alarm)*f*- multiply them together, and sum out

_{3}(Fire)*Fire*from the resulting factor. Call this factor

*F*. At this stage, the following factors remain:

_{6}(Tampering,Alarm)

f _{0}(Tampering),f _{4}(Alarm, Leaving),f _{5}(Leaving),f _{6}(Tampering,Alarm).

Suppose *Alarm* is eliminated next. VE multiplies the factors
containing *Alarm* and sums out *Alarm* from the product, giving a
factor, call it *f _{7}*:

f_{7}(Tampering,Leaving) = ∑_{Alarm}f_{4}(Alarm,Leaving) ×f_{6}(Tampering,Alarm)

It then has the following factors:

f _{0}(Tampering),f _{5}(Leaving),f _{7}(Tampering,Leaving).

Eliminating *Leaving* results in the factor

f_{8}(Tampering) = ∑_{Leaving}f_{5}(Leaving) ×f_{7}(Tampering,Leaving).

To determine the distribution over *Tampering*, multiply the
remaining factors, giving

f_{9}(Tampering) = f_{0}(Tampering) ×f_{8}(Tampering).

The posterior distribution over tampering is given by

(f_{9}(Tampering))/(∑_{Tampering}f_{9}(Tampering)) .

Note that the denominator is the prior probability of the evidence.

**Example 6.21:**Consider the same network as in the previous example but with the following query:

P(Alarm|Fire=true).

When *Fire* is eliminated, the factor *P(Fire)* becomes a
factor of no variables; it is just a number, *P(Fire=true)*.

Suppose *Report* is eliminated next. It is in one factor, which represents
*P(Report|Leaving)*. Summing over all of the values of *Report* gives
a factor on *Leaving*, all of whose values are 1. This is because
*P(Report=true|Leaving=v)+P(Report=false|Leaving=v) = 1* for any value
*v* of *Leaving*.

Similarly, if you eliminate *Leaving* next, you multiply a factor that
is all 1 by a factor representing *P(Leaving|Alarm)* and sum out
*Leaving*. This, again, results in a factor all of whose values are
1.

Similarly, eliminating *Smoke* results in a factor of no variables,
whose value is *1*. Note that even if smoke had been observed,
eliminating smoke
would result in a factor of no variables, which would not affect the
posterior distribution on *Alarm*.

Eventually, there is only the factor on *Alarm* that represents its
prior probability and a constant factor that will cancel in the
normalization.

The complexity of the algorithm depends on a measure of complexity of
the network. The size of a tabular representation of a factor is
exponential in the number of variables in the factor. The
**treewidth** of a network, given an elimination ordering, is
the maximum number of variables in a factor created by summing out a
variable, given the elimination ordering. The **treewidth** of a
belief network is the minimum treewidth over all elimination
orderings. The treewidth depends only on the graph structure and is a
measure of the sparseness of the graph. The complexity of VE is exponential in the treewidth and linear in the number
of variables. Finding the elimination ordering with minimum treewidth
is NP-hard, but there are some good elimination ordering
heuristics, as discussed for CSP VE.

There are two main ways to speed up this algorithm. Irrelevant variables can be pruned, given the observations and the query. Alternatively, it is possible to compile the graph into a secondary structure that allows for caching of values.