8.4 Probabilistic Inference

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

8.4.1 Variable Elimination for Belief Networks

The variable elimination (VE) algorithm, as used for finding solutions to CSPs and for optimization with soft constraints, can be adapted to find the posterior distribution for a variable in a belief network given conjunctive evidence. Many of the efficient exact methods are variants of this algorithm.

The algorithm is based on the notion that a belief network specifies a factorization of the joint probability distribution.

Before we provide the algorithm, we define factors and the operations that will be performed on them. Recall that P(XY) is a function from variables (or sets of variables) X and Y into the real numbers that, given a value for X and a value for Y, returns the conditional probability of the value for X, given the value for Y. A function of variables is called a factor. The VE algorithm for belief networks manipulates factors to compute posterior probabilities.

Conditional Probability Tables

A conditional probability, P(YX1,,Xk) is a function from the variables Y,X1,,Xk into non-negative numbers that satisfies the constraints that for each assignment of values to all of X1,,Xk the values for Y sum to 1. That is, given values to all of the variables, the function returns a number that satisfies the constraint:

x1xkydomain(Y)P(Y=yX1=x1,,Xk=xk)=1 (8.1)

With a finite set of variables with finite domains, conditional probabilities can be implemented as arrays. If there is an ordering of the variables (e.g., alphabetical) and the values in the domains are mapped into non-negative integers, there is a unique representation of each factor as a one-dimensional array that is indexed by natural numbers. This representation for a conditional probability is called a conditional probability table or CPT.

If the child variable is treated the same as the parent variables, the information is redundant; more numbers are specified than is required and a table could be inconsistent if it does not satisfy the above constraint. Using the redundant representation is common, but the following two methods are also used to specify and store probabilities:

  • Store unnormalized probabilities, which are non-negative numbers that are proportional to the probability. The probability can be computed by normalizing: dividing each value by the sum of the values, summing over all values for the domain of Y.

  • Store the probability for all-but-one of the values of Y. In this case, the probability of this other value can be computed to obey the constraint above. In particular, if Y is binary, we only need to represent the probability for one value, say Y=true, and the probability for other other value, Y=false, can be computed from this.

Fire P(smokeFire)
true 0.9
false 0.01
Fire Tampering P(alarmFire,Tampering)
true true 0.5
true false 0.99
false true 0.85
false false 0.0001
X Y P(Z=tX,Y)
t t 0.1
t f 0.2
f t 0.4
f f 0.3
Figure 8.6: Conditional probability tables
Example 8.18.

Figure 8.6 shows three conditional probabilities tables. On the top left is P(SmokeFire) and on the top right is P(AlarmFire,Tampering), from Example 8.15, which use Boolean variables.

These tables do not specify the probability for the child being false. This can be computed from the given probabilities, for example,

P(Alarm=falseFire=false,Tampering=true)=1-0.85=0.15

On the bottom is a simple example, with domains {t,f}, which will be used in the following examples.

Given a total ordering of the parents, such as Fire is before Tampering in the right table, and a total ordering of the values, such as true is before false, the table can be specified by giving the array of numbers in lexicographic order, such as [0.5,0.99,0.85,0.0001].

Factors

A factor is a function from a set of random variables into a number. A factor f on variables X1,,Xj is written as f(X1,,Xj). The variables X1,,Xj are the variables of the factor f, and f is a factor on X1,,Xj.

Conditional probabilities are factors that also obey the constraint of Equation 8.1. This section describes some operations on factors, including conditioning, multiplying factors and summing out variables. The operations can be used for conditional probabilities, but do not necessarily result in conditional probabilities.

Suppose f(X1,,Xj) is a factor and each vi is an element of the domain of Xi. f(X1=v1,X2=v2,,Xj=vj) is a number that is the value of f when each Xi has value vi. Some of the variables of a factor can be assigned to values to make a new factor on the other variables. This operation is called conditioning on the values of the variables that are assigned. For example, f(X1=v1,X2,,Xj), sometimes written as f(X1,X2,,Xj)X1=v1, where v1 is an element of the domain of variable X1, is a factor on X2,,Xj.

Example 8.19.

Figure 8.7 shows a factor r(X,Y,Z) on variables X, Y and Z as a table. This assumes that each variable is binary with domain {t,f}. This factor could be obtained from the last conditional probability table given in Figure 8.6. Figure 8.7 also gives a table for the factor r(X=t,Y,Z), which is a factor on Y,Z. Similarly, r(X=t,Y,Z=f) is a factor on Y, and r(X=t,Y=f,Z=f) is a number.

r(X,Y,Z)= X Y Z val t t t 0.1 t t f 0.9 t f t 0.2 t f f 0.8 f t t 0.4 f t f 0.6 f f t 0.3 f f f 0.7 r(X=t,Y,Z)= Y Z val t t 0.1 t f 0.9 f t 0.2 f f 0.8 r(X=t,Y,Z=f)= Y val t 0.9 f 0.8 r(X=t,Y=f,Z=f)=0.8

Figure 8.7: An example factor and assignments

Factors can be multiplied together. Suppose f1 and f2 are factors, where f1 is a factor that contains variables X1,,Xi and Y1,,Yj, and f2 is a factor with variables Y1,,Yj and Z1,,Zk, where Y1,,Yj are the variables in common to f1 and f2. The product of f1 and f2, written f1*f2, is a factor on the union of the variables, namely X1,,Xi,Y1,,Yj,Z1,,Zk, defined by:

(f1*f2) (X1,,Xi,Y1,,Yj,Z1,,Zk)
=f1(X1,,Xi,Y1,,Yj)*f2(Y1,,Yj,Z1,,Zk).
f1=
A B val
t t 0.1
t f 0.9
f t 0.2
f f 0.8
f2=
B C val
t t 0.3
t f 0.7
f t 0.6
f f 0.4

f1*f2= A B C val t t t 0.03 t t f 0.07 t f t 0.54 t f f 0.36 f t t 0.06 f t f 0.14 f f t 0.48 f f f 0.32

Figure 8.8: Multiplying factors
Example 8.20.

Figure 8.8 shows the product of f1(A,B) and f2(B,C), which is a factor on A,B,C. Note that (f1*f2)(A=t,B=f,C=f)=f1(A=t,B=f)*f2(B=f,C=f)=0.9*0.4=0.36.

The remaining operation is to sum out a variable in a factor. Given factor f(X1,,Xj), summing out a variable, say X1, results in a factor on the other variables, X2,,Xj, defined by

(X1f)(X2,,Xj)=f(X1=v1,X2,,Xj)++f(X1=vk,X2,Xj),

where {v1,,vk} is the set of possible values of variable X1.

f3= A B C val t t t 0.03 t t f 0.07 t f t 0.54 t f f 0.36 f t t 0.06 f t f 0.14 f f t 0.48 f f f 0.32           Bf3= A C val t t 0.57 t f 0.43 f t 0.54 f f 0.46

Figure 8.9: Summing out a variable from a factor
Example 8.21.

Figure 8.9 gives an example of summing out variable B from a factor f3(A,B,C), which is a factor on A,C. Notice how

(Bf3)(A=t,C=f)= f3(A=t,B=t,C=f)+f3(A=t,B=f,C=f)
= 0.07+0.36
= 0.43

Variable Elimination

Given evidence Y1=v1, …, Yj=vj, and query variable or variables Q, the problem of computing the posterior distribution on Q can be reduced to the problem of computing the probability of conjunctions:

P(QY1=v1,,Yj=vj) =P(Q,Y1=v1,,Yj=vj)P(Y1=v1,,Yj=vj)
=P(Q,Y1=v1,,Yj=vj)QP(Q,Y1=v1,,Yj=vj).

The algorithm computes the factor P(Q,Y1=v1,,Yj=vj) and normalizes. Note that this is a factor only of Q; given a value for Q, it returns a number that is the probability of the conjunction of the evidence and the value for Q.

Suppose the variables of the belief network are X1,,Xn. To compute the factor P(Q,Y1=v1,,Yj=vj), sum out the other variables from the joint distribution. Suppose Z1,,Zk is an enumeration of the other variables in the belief network, that is,

{Z1,,Zk}={X1,,Xn}{Q,Y1,,Yj}

and the variables Zi are ordered according to an elimination ordering.

The probability of Q conjoined with the evidence is

p(Q,Y1=v1,,Yj=vj)=ZkZ1P(X1,,Xn)Y1=v1,,Yj=vj.

By the chain rule and the definition of a belief network,

P(X1,,Xn)=i=1nP(Xiparents(Xi))

where parents(Xi) is the set of parents of variable Xi.

The belief network inference problem is thus reduced to a problem of summing out a set of variables from a product of factors. The distribution law specifies that a sum of products such as xy+xz, can be simplified by distributing out the common factors (here x), which results in x(y+z). The resulting form is more efficient to compute. Distributing out common factors is the essence of the VE algorithm. The elements multiplied together are called “factors” because of the use of the term in algebra. Initially, the factors represent the conditional probability distributions, but the intermediate factors are just functions on variables created by adding and multiplying factors.

To compute the posterior distribution of a query variable given observations

  1. 1.

    Construct a factor for each conditional probability distribution.

  2. 2.

    Eliminate each of the non-query variables:

    • if the variable is observed, its value is set to the observed value in each of the factors in which the variable appears,

    • otherwise the variable is summed out.

  3. 3.

    Multiply the remaining factors and normalize.

To sum out a variable Z from a product f1,,fk of factors, first partition the factors into those not containing Z, say f1,,fi, and those containing Z, fi+1,,fk; then distribute the common factors out of the sum:

Zf1**fk=f1**fi*(Zfi+1**fk).

VE explicitly constructs a representation (in terms of a multidimensional array, a tree, or a set of rules) of the rightmost factor.

1: procedure VE_BN(Vs,Ps,e,Q)
2:      Inputs
3:          Vs: set of variables
4:          Ps: set of factors representing the conditional probabilities
5:          e: the evidence, a variable-value assignment to some of the variables
6:          Q: a query variable      
7:      Output
8:          posterior distribution on Q
9:      Fs:=Ps Fs is the current set of factors
10:      for each XVs-{Q} using some elimination ordering do
11:          if X is observed then
12:               for each FFs that involves X do
13:                   assign X in F to its observed value in e               
14:          else
15:               Rs:={FFs:F involves X}
16:               let T be the product of the factors in Rs
17:               N:=XT
18:               Fs:=FsRs{N}               
19:      let T be the product of the factors in Fs
20:      N:=QT
21:      return T/N
Figure 8.10: Variable elimination for belief networks

Figure 8.10 gives pseudocode for the VE algorithm. The elimination ordering could be given a priori or computed on the fly. It is worthwhile to select observed variables first in the elimination ordering, because eliminating these simplifies the problem.

This algorithm assumes that the query variable is not observed. If it is observed to have a particular value, its posterior probability is just 1 for the observed value and 0 for the other values.

Example 8.22.

Consider Example 8.15 with the query

P(TamperingSmoke=trueReport=true).

Suppose it first eliminates the observed variables, Smoke and Report. After these are eliminated, the following factors remain:

ConditionalProbabilityFactorP(Tampering)f0(Tampering)P(Fire)f1(Fire)P(AlarmTampering,Fire)f2(Tampering,Fire,Alarm)P(Smoke=yesFire)f3(Fire)P(LeavingAlarm)f4(Alarm,Leaving)P(Report=yesLeaving)f5(Leaving)

Suppose Fire is next in the elimination ordering. To eliminate Fire, collect all of the factors containing Fire, namely f1(Fire), f2(Tampering,Fire,Alarm), and f3(Fire), multiply them together, and sum out Fire from the resulting factor. Call this factor F6(Tampering,Alarm). At this stage, Fs contains the factors:

f0(Tampering),f4(Alarm,Leaving),f5(Leaving),f6(Tampering,Alarm).

Suppose Alarm is eliminated next. VE multiplies the factors containing Alarm and sums out Alarm from the product, giving a factor, call it f7:

f7(Tampering,Leaving)=Alarmf4(Alarm,Leaving)*f6(Tampering,Alarm)

Fs then contains the factors:

f0(Tampering),f5(Leaving),f7(Tampering,Leaving).

Eliminating Leaving results in the factor

f8(Tampering)=Leavingf5(Leaving)*f7(Tampering,Leaving).

To determine the distribution over Tampering, multiply the remaining factors, giving

f9(Tampering)=f0(Tampering)*f8(Tampering).

The posterior distribution over tampering is given by

f9(Tampering)Tamperingf9(Tampering).

Note that the denominator is the prior probability of the evidence, namely P(Smoke=trueReport=true)

Example 8.23.

Consider the same network as in the previous example but with the following query:

P(AlarmFire=true).

When Fire is eliminated, the factor P(Fire) becomes a factor of no variables; it is just a number, P(Fire=true).

Suppose Report is eliminated next. It is in one factor, which represents P(ReportLeaving). Summing over all of the values of Report gives a factor on Leaving, all of whose values are 1. This is because P(Report=trueLeaving=v)+P(Report=falseLeaving=v)=1 for any value v of Leaving.

If Leaving is eliminated next, a factor that is all 1 is multiplied by a factor representing P(LeavingAlarm) and Leaving is summed out. This, again, results in a factor all of whose values are 1.

Similarly, eliminating Smoke results in a factor of no variables, whose value is 1. Note that even if smoke had also been observed, eliminating Smoke would result in a factor of no variables, which would not affect the posterior distribution on Alarm.

Eventually, there is only the factor on Alarm that represents its prior probability and a constant factor that will cancel in the normalization.

To speed up the inference, variables that are irrelevant to answer a query given the observations can be pruned. In particular, any node that has no observed or queried descendants and is itself not observed or queried may be pruned. This may result in a smaller network with fewer factors and variables. For example, to compute P(AlarmFire=true), the variables Report, Leaving and Smoke may be pruned.

The complexity of the algorithm depends on a measure of complexity of the network. The size of a tabular representation of a factor is exponential in the number of variables in the factor. The treewidth of a network, given an elimination ordering, is the maximum number of variables in a factor created by summing out a variable when using the elimination ordering. The treewidth of a belief network is the minimum treewidth over all elimination orderings. The treewidth depends only on the graph structure and is a measure of the sparseness of the graph. The complexity of VE is exponential in the treewidth and linear in the number of variables. Finding the elimination ordering with minimum treewidth is NP-hard, but there are some good elimination ordering heuristics, as discussed for CSP VE.

Example 8.24.

Consider the belief network of Figure 8.4. To compute the probability of Sneezing, the variables Fever and "Achoo"sound may be pruned, as they have no children and are not observed or queried. Summing out Season involves multiplying the factors

P(Season),P(PollenSeason),P(InfluenzaSeason)

and results in a factor on Influenza and Pollen. The treewidth of this belief network is two; there is an ordering of the variables that only constructs factors of size one or two, and there is no ordering of the variables that has a smaller treewidth.

The moral graph of a belief network is the undirected graph where there is an arc between any two nodes that appear in the same initial factor. This is obtained by “marrying” the parents of a node, and removing the directions. If we prune as outlined in the previous paragraph, moralize the graph, and remove all observed variables, only those variables connected to the query in this graph are relevant to answering the query. The other variables can be pruned.

Many modern exact algorithms use what is essentially variable elimination, and they speed it up by preprocessing as much as possible into a secondary structure before any evidence arrives. This is appropriate when, for example, the same belief network may be used for many different queries, and where observations are added incrementally. The algorithms save intermediate results so that evidence is incrementally added. Unfortunately, extensive preprocessing, allowing arbitrary sequences of observations and deriving the posterior on each variable, precludes pruning the network. So for each application you need to choose whether you will save more by pruning irrelevant variables for each query and observation or by preprocessing before you have any observations.