9.5 Exact Probabilistic Inference

The task of probabilistic inference is to compute the probability of a query variable Q given evidence e, where e is a conjunction of assignments of values to some of the variables; that is, to compute P(Qe). This is the task that is needed for decision making.

Given evidence e, and query variable or variables Q, the problem of computing the posterior distribution on Q can be reduced to the problem of computing the probability of conjunctions, using the definition of conditioning:

P(Qe) =P(Qe)P(e)
=P(Qe)QP(Qe) (9.4)

where Q means summing over all of the values of Q, which is called marginalizing Q and QP(Qe) is an abbreviation for vdomain(Q)P(Q=ve), where Q=v is the proposition that is true when Q has value v.

To compute the probability of a product, first introduce all of the other variables. With all of the variables, the definition of a belief network is used to decompose the resulting joint probability into a product of factors.

Suppose {X1,,Xn} is the set of all variables in the belief network and the evidence e is Y1=v1,,Yj=vj, where {Y1,,Yj}{X1,,Xn}.

Suppose Z1,,Zk is a total ordering of the variables in the belief network other than the observed variables, Y1,,Yj, and the query variable, Q. That is:

{Z1,,Zk}={X1,,Xn}{Q,Y1,,Yj}.

The probability of Q conjoined with the evidence is

p(Q,e)=Z1ZkP(X1,,Xn)e

where the subscript e means that the observed variables are set to their observed values.

Once all of the variables are modeled, the definition of a belief network, Equation 9.1, shows how to decompose the joint probability:

P(X1,,Xn)=i=1nP(Xiparents(Xi))

where parents(Xi) is the set of parents of variable Xi.

So

p(Q,e)=Z1Zki=1nP(Xiparents(Xi))e. (9.5)

The belief-network inference problem is thus reduced to a problem of summing out a set of variables from a product of factors.

Naive Search-Based Algorithm

The naive search-based algorithm shown in Figure 9.9 computes Equation 9.5 from the outside-in. The algorithm is similar to the search-based algorithm for constraint satisfaction problems (see Figure 4.1).

1: procedure prob_dfs(Con,Fs)
2:      Inputs
3:          Con: context
4:          Fs: set of factors      
5:      Output
6:          v1,,vkfFsfCon where {v1,,vk}=vars(Fs)vars(Con)
7:      if Fs={} then
8:          return 1
9:      else if there is fFs that can be evaluated in Con then
10:          return eval(f,Con)prob_dfs(Con,Fs{f})
11:      else
12:          select variable var in vars(Fs)vars(Con)
13:          sum:=0
14:          for val in domain(var) do
15:               sum:=sum+prob_dfs(Con{var=val},Fs)          
16:          return sum      
Figure 9.9: Naive search-based inference algorithm

The algorithm takes in a context, Con, an assignment of values to some of the variables, and a set, Fs, of factors. The aim is to compute the value of the product of the factors, evaluated with the assignment Con, after all of the variables not in the context are summed out. Initially the context is usually the observations and an assignment of a value to the query variable, and the factors are the conditional probabilities of a belief network.

In this algorithm, the function vars returns the set of variables in a context or set of factors, so vars(context) is the set of variables assigned in context and vars(factors) is the set of all variables that appear in any factor in factors. Thus, vars(factors)vars(context) is the set of variables that appear in factors that are not assigned in the context. The algorithm returns the value of the product of the factors with all of these variables summed out. This is the probability of the context, given the model specified by the factors.

If there are no factors, the probability is 1. Factors are evaluated as soon as they can be. A factor on a set of variables can be evaluated when all of the variables in the factor are assigned in the context. Some representations of factors, such as decision trees, may be able to be evaluated before all of the variables are assigned. The function eval(f,context) returns the value of factor f in context; it is only called when it can return a value.

If neither of the previous conditions arise, the algorithm selects a variable and branches on it, summing the values. This is marginalizing the selected variable. Which variable is selected does not affect the correctness of the algorithm, but may affect the efficiency.

Exploiting Independence in Inference

The naive search-based algorithm evaluates the sum of Equation 9.5 directly. However, it is possible to do much better.

The distribution law specifies that a sum of products, such as xy+xz, can be simplified by distributing out the common factors (here x), which results in x(y+z). The resulting form is more efficient to compute, because it has only one addition and one multiplication. Distributing out common factors is the essence of the efficient exact algorithms. The elements multiplied together are called “factors” because of the use of the term in algebra.

Refer to caption
Figure 9.10: A simple chain belief network
Example 9.22.

Consider the simple chain belief network of Figure 9.10. Suppose you want to compute P(D), the probability of D with no observations; the denominator of Equation 9.4 is 1, so let’s ignore that. The variable order A,B,C, results in the factorization

P(D) =ABCP(A,B,C,D)
=ABCP(A)P(BA)P(CB)P(DC).

Consider the rightmost sum (C). The left two terms do not include C and can be distributed out of that sum, giving

P(D) =ABP(A)P(BA)CP(CB)P(DC).

Similarly, P(A) can be distributed out of B, giving

P(D) =AP(A)BP(BA)CP(CB)P(DC).

The search-based algorithms, described below, start at the leftmost sum and sum the values. Variable elimination [see Section 9.5.2] is the dynamic programming variant that creates factors from the inside-out; in this case creating an explicit representation for CP(CB)P(DC), which is multiplied by P(BA), and then B is summed out, and so on.

You could choose a different ordering of the variables, which can result in a different factorization. For example, the variable order C,B,A gives

P(D) =CBAP(A,B,C,D)
=CBAP(A)P(BA)P(CB)P(DC)
=CP(DC)BP(CB)AP(A)P(BA).
Example 9.23.

Consider the query P(BD=true) in the simple chain belief network of Figure 9.10. Let’s write D=true as d. Equation 9.4 gives

P(Bd)=P(B,d)DP(B,d).

Computing P(B,d) with the variable order A,C gives

P(B,d) =ACP(A,B,C,d)
=ACP(A)P(BA)P(CB)P(dC).

Distributing the factors out of the sums gives

P(B,d) =(AP(A)P(BA))(CP(CB)P(dC)).

This gives two independent problems that can be solved separately. Note that, in this case, the variable order C,A gives the same factorization.

9.5.1 Recursive Conditioning

Recursive conditioning is one of the most efficient exact algorithms. It adds two features to the naive search-based inference algorithm of Figure 9.9, namely caching and recognizing subproblems that can be solved independently. The motivation for this can be seen in the following two examples.

Example 9.24.

Consider the chain belief network of Figure 9.10 with the last factorization of Example 9.22. First splitting on D, then C, B, A, results in the tree of Figure 9.11(a). In general, you need to split on the values of the query variable to compute the normalizing constant; while it is not necessary here (because there is no evidence, it must be 1), we show it anyway.

Refer to caption
Figure 9.11: Search space of (a) naive search and (c) recursive conditioning, with (b) the factors that can be evaluated

The left branch is for the variable having value false, and the right branch for the variable having value true.

Part (b) of the figure shows the stage where each factor can be evaluated. P(DC) can be evaluated when D and C are assigned values. P(CB) can be evaluated when D, C, and B have been assigned values. For this variable ordering, all variables need to be split before P(A) and P(BA) can be evaluated.

Consider the first recursive call after D and C have been assigned false, and P(DC) has been evaluated:

prob_dfs({D=false,C=false},{P(CB),P(BA),P(A)}).

Notice that the factors do not involve D, and so this call will have the same value as when the context is {D=true,C=false}.

Recursive conditioning “forgets” D, and stores the computed value for

prob_dfs({C=false},{P(CB),P(BA),P(A)})

in a cache (a dictionary), and then retrieves that value for the D=true case.

Similarly, after D, C, and B have been assigned a value, the factors are {P(BA),P(A)}, which will give the same values for C=true as for C=false. Again, recursive conditioning saves the result of prob_dfs({B=false},{P(BA),P(A)}) in a cache and retrieves that value for the other value of C.

This results in the search space of Figure 9.11(c), where the dashed lines indicate a cache lookup.

Example 9.25.

Consider the query P(Bd) for the belief network in Example 9.23. The assigned value, D=true, is given as the initial context. After B is assigned a value (which is needed for the denominator of Equation 9.4), the naive search algorithm has the recursive call

prob_dfs({B=false,D=true},{P(DC),P(CB),P(BA),P(A)}).

This can be decomposed into two independent problems, one using the factors P(DC),P(CB), which has the same value independently of the assignment to A, and the other for the factors P(BA),P(A), which has the same value independently of the assignment to C. These factors only have B in common, which has been assigned a value in the context.

Thus, this call can be decomposed into the product:

prob_dfs ({B=false,D=true},{P(DC),P(CB)})
prob_dfs({B=false,D=true},{P(BA),P(A)}).
1: procedure prob_RC(Con,Fs)
2:      Inputs
3:          Con: context
4:          Fs: set of factors      
5:      Output
6:          v1,,vkfFsfCon where {v1,,vk}=vars(Fs)vars(Con)
7:      if (Con,Fs) in cache with value v then
8:          return v
9:      else if  there is X=v in Con such that Xvars(Fs) then
10:          return prob_RC({X=vCon:Xvars(Fs)},Fs)
11:      else if  fs={fFs:f can be evaluated in Con} is not empty then
12:          return (ffseval(f,Con))prob_RC(Con,Fsfs)
13:      else if Fs=Fs1Fs2 where vars(Fs1)vars(Fs2) are all assigned in Con then
14:          return prob_RC(Con,Fs1)prob_RC(Con,Fs2)
15:      else
16:          select variable var in vars(Fs)vars(Con)
17:          sum:=0
18:          for val in domain(var) do
19:               sum:=sum+prob_RC(Con{var=val},Fs)          
20:          add (Con,Fs) to cache with value sum
21:          return sum      
Figure 9.12: Recursive conditioning algorithm

The recursive conditioning algorithm is shown in Figure 9.12.

The function vars returns the set of variables, as described previously for the naive algorithm.

The cache is a global dictionary that can be reused through recursive calls. It remembers what has already been computed. The keys are (Con,Fs) pairs. In order to cover the base case, the cache is initialized to have value 1 for the key ({},{}).

The first condition (line 7) is to check whether the answer has already been computed and stored in the cache. If it has, the stored value is returned. Note that because of the initialization of the cache, this can replace the base case for the naive search.

The second condition (line 9) checks whether there are variables in the context that are not in any factors, and “forgets” them. This is needed so that the caching works properly, as in Example 9.24.

The third condition (line 11) checks whether some factors can be evaluated in the given context. If so, these factors are evaluated and removed from the factors of the recursive call.

The fourth condition (line 13) checks whether the problem can be split into independent parts that can be solved separately and multiplied. Here is the disjoint union, where Fs=Fs1Fs2 means that non-empty sets Fs1 and Fs2 have no element in common and together contain all of the elements of Fs. Thus, Fs1 and Fs2 are a partition of Fs. If all of the variables in common are assigned in the context, the problem can be decomposed into two independent problems. The algorithm could, alternatively, split into multiple nonempty sets of factors that have no unassigned variables in any pair. They would then be the connected components, where the two factors are connected if they both contain a variable that is not assigned in the context. If there are more than two connected components, the algorithm of Figure 9.12 would find the connected components in recursive calls. Note that the subsequent recursive calls typically result in forgetting the variables that are only in the other partition, so it does not need to be done explicitly here.

If none of the other conditions hold (line 15), the algorithm branches. It selects a variable, and sums over all of the values of this variable. Which variable is selected does not affect the result, but can affect the efficiency. Finding the optimal variable to select is an NP-hard problem. Often the algorithm is given a split-order of variables in the order they should be branched on. The algorithm saves the computed result in the cache.

Examples of the application of recursive conditioning are given in Example 9.24 and Example 9.25.

Example 9.26.

Consider Example 9.13 with the query

P(TamperingSmoke=trueReport=true).

The initial call (using abbreviated names, and lower-case letters for true value) for the Ta=false case is

prob_RC({sm,re,¬ta},{P(Ta),P(Fi),P(AlTa,Fi),P(SmFi),P(LeAl),P(ReLe)}).

P(Ta) can be evaluated. If it then splits on Le, the call is

prob_RC({sm,re,¬ta,¬le},{P(Fi),P(AlTa,Fi),P(SmFi),P(LeAl),P(ReLe)}).

At this stage P(ReLe) can be evaluated, and then Re is not in the remaining factors, so re can be forgotten in the context. If it then splits on Al, the call is

prob_RC({sm,¬ta,¬le,¬al},{P(Fi),P(AlTa,Fi),P(SmFi),P(LeAl)}).

Then P(LeAl) can be evaluated, and ¬le forgotten. Splitting on Fi results in the call

prob_RC({sm,¬ta,¬al,¬fi},{P(Fi),P(AlTa,Fi),P(SmFi)}).

All of the factors can be evaluated, and so this call returns the value P(¬fi)P(¬al¬ta,¬fi)P(sm¬fi).

The search continues; where the variables are forgotten, the cache is used to look up values rather than computing them. This can be summarized as

Variable Split on Factors Evaluated Forgotten
Ta P(Ta)
Le P(ReLe) Re
Al P(LeAl) Le
Fi P(Fi), P(AlTa,Fi), P(SmFi)

Different variable orderings can result in quite different evaluations, for example, for the same query P(Tasm,re), for the splitting order Ta, Fi, Al, Le:

Variable Split on Factors Evaluated Forgotten
Ta P(Ta)
Fi P(Fi), P(SmFi) Sm
Al P(AlTa,Fi) Fi, Ta
Le P(LeAl), P(ReLe)

If Al was split before Fi and Le, there are two independent subproblems that can be solved separately; subproblem 1 involving Fi and subproblem 2 involving Le:

Variable Split on Factors Evaluated Forgotten
Ta P(Ta)
Al
subproblem 1:
Fi P(Fi), P(SmFi), P(AlTa,Fi) Sm, Fi
subproblem 2:
Le P(LeAl), P(ReLe) Le, Re

The solutions to the subproblems are multiplied.

Some of the conditions checked in Figure 9.12 only depend on the variables in the context, and not on the values of the variables. The results of these conditions can be precomputed and looked up during the search, saving a lot of time.

Some representations of factors (e.g., tables or logistic regression) require all variables to be assigned before they can be evaluated, whereas other representations (e.g., decision trees or weighted logical formula) can often be evaluated before all variables involved have been assigned values. To exploit this short-circuiting effectively, there needs to be good indexing to determine which factors can be evaluated.

9.5.2 Variable Elimination for Belief Networks

Variable elimination (VE) is the dynamic programming variant of recursive conditioning. Consider the decomposition of probability of the query Q and the evidence e (Equation 9.5, reproduced here):

p(Q,e)=Z1Zki=1nP(Xiparents(Xi))e.

VE carries out the rightmost sum, here Zk first, eliminating Zk, producing an explicit representation of the resulting factor. The resulting factorization then contains one fewer variables. This can be repeated until there is an explicit representation of P(Q,e). The resulting factor is a function just of Q; given a value for Q, it evaluates to a number that is the probability of the evidence conjoined with the value for Q. The conditional probability can be obtained by normalization (dividing each value by the sum of the values).

To compute the posterior distribution of a query variable given observations:

  1. 1.

    Construct a factor for each conditional probability distribution.

  2. 2.

    Eliminate each of the non-query variables:

    • if the variable is observed, its value is set to the observed value in each of the factors in which the variable appears

    • otherwise, the variable is summed out.

      To sum out a variable Z from a product f1,,fk of factors, first partition the factors into those not containing Z, say f1,,fi, and those containing Z, fi+1,,fk; then distribute the common factors out of the sum:

      Zf1fk=f1fi(Zfi+1fk).

      Variable elimination explicitly constructs a representation (in terms of a multidimensional array, a tree, or a set of rules) of the rightmost factor. We have then simplified the problem to have one fewer variable.

  3. 3.

    Multiply the remaining factors and normalize.

1: procedure VE_BN(Vs,Ps,e,Q)
2:      Inputs
3:          Vs: set of variables
4:          Ps: set of factors representing the conditional probabilities
5:          e: the evidence, a variable-value assignment to some of the variables
6:          Q: a query variable      
7:      Output
8:          posterior distribution on Q
9:      Fs:=Ps Fs is the current set of factors
10:      for each XVs{Q} using some elimination ordering do
11:          if X is observed then
12:               for each FFs that involves X do
13:                   assign X in F to its observed value in e               
14:          else
15:               Rs:={FFs:F involves X}
16:               let T be the product of the factors in Rs
17:               construct an explicit representation for N=XT
18:               Fs:=FsRs{N}               
19:      let T be the product of the factors in Fs
20:      N:=QT
21:      return T/N
Figure 9.13: Variable elimination for belief networks

Figure 9.13 gives pseudocode for the VE algorithm. The elimination ordering could be given a priori or computed on the fly. This algorithm assigns the values to the observations as part of the elimination. It is worthwhile to select observed variables first in the elimination ordering, because eliminating these simplifies the problem.

This algorithm assumes that the query variable is not observed. If it is observed to have a particular value, its posterior probability is just 1 for the observed value and 0 for the other values.

Example 9.27.

Consider Example 9.13 with the query

P(TamperingSmoke=trueReport=true).

Suppose it first eliminates the observed variables, Smoke and Report. This created a factor for P(Smoke=yesFire) which is just a function of Fire; call it f0(Fire). Given a value for Fire, this evaluates to a number. P(Report=yesLeaving) becomes a factor f1(Leaving).

Suppose Fire is next in the elimination ordering. To eliminate Fire, collect all of the factors containing Fire, namely P(Fire), P(AlarmTampering,Fire), and f0(Fire), and replace them with an explicit representation of the resulting factor. This factor is on Tampering and Alarm as these are the other variables in the factors being multiplied. Call it f2(Tampering,Alarm):

f2(Tampering,Alarm) =FireP(Fire)(AlarmTampering,Fire)f0(Fire).

We explicitly store a number for each value for Tampering and Alarm. For example, f2(Tampering=t, Alarm=f) has value

vdomain(Fire)P(Fire=v)P(Alarm=fTampering=t,Fire=v)f0(Fire=v).

At this stage, Fs contains the factors

P(Tampering),P(LeavingAlarm),f1(Leaving),f2(Tampering,Alarm).

Suppose Alarm is eliminated next. VE multiplies the factors containing Alarm and sums out Alarm from the product, giving a factor, call it f7:

f3(Tampering,Leaving)=AlarmP(LeavingAlarm)f2(Tampering,Alarm).

Fs then contains the factors

P(Tampering),f1(Leaving),f3(Tampering,Leaving).

Eliminating Leaving results in the factor

f4(Tampering)=Leavingf1(Leaving)f3(Tampering,Leaving).

The posterior distribution over Tampering is given by

P(Tampering)f4(Tampering)TamperingP(Tampering)f4(Tampering).

Note that the denominator is the prior probability of the evidence, namely P(Smoke=trueReport=true).

Example 9.28.

Consider the same network as in the previous example but with the query

P(AlarmFire=true).

When Fire is eliminated, the factor P(Fire) becomes a factor of no variables; it is just a number, P(Fire=true).

Suppose Report is eliminated next. It is in one factor, which represents P(ReportLeaving). Summing over all of the values of Report gives a factor on Leaving, all of whose values are 1. This is because P(Report=trueLeaving=v)+P(Report=falseLeaving=v)=1 for any value v of Leaving.

If Leaving is eliminated next, a factor that is all 1 is multiplied by a factor representing P(LeavingAlarm) and Leaving is summed out. This, again, results in a factor all of whose values are 1.

Similarly, eliminating Smoke results in a factor of no variables, whose value is 1. Note that even if smoke had also been observed, eliminating Smoke would result in a factor of no variables, which would not affect the posterior distribution on Alarm.

Eventually, there is only the factor on Alarm that represents its posterior probability and a constant factor that will cancel in the normalization.

9.5.3 Exploiting Structure and Compilation

When the ordering of the splits in recursive conditioning is the reverse of the elimination ordering of variable elimination, the two algorithms carry out the same multiplications and additions, and the values stored in the factors of variable elimination are the same numbers as stored in the cache of recursive conditioning. The main differences are:

  • Recursive conditioning just evaluates the input factors, whereas variable elimination requires an explicit representation of intermediate factors. This means that recursive conditioning can be used where the conditional probabilities can be treated as black boxes that output a value when enough of their variables are instantiated. Thus, recursive conditioning allows for diverse representations of conditional probabilities, such as presented in Section 9.3.3. This also enables it to exploit structure when all of the variables need not be assigned. For example, in a decision tree representation of a conditional probability, it is possible to evaluate a factor before all values are assigned.

  • A straightforward implementation of variable elimination can use a more space-efficient tabular representation of the intermediate factors than a straightforward implementation of the cache in recursive conditioning.

  • When there are zeros in a product, there is no need to compute the rest of the product. Zeros arise from constraints where some combination of values to variables is impossible. This is called determinism. Determinism can be incorporated into recursive conditioning in a straightforward way.

  • Recursive conditioning can act as an any-space algorithm, which can use any amount of space by just not storing some values in the cache, but instead recomputing them, trading off space for time.

To speed up the inference, variables that are irrelevant to a conditional probability query can be pruned. In particular, any node that has no observed or queried descendants and is itself not observed or queried may be pruned. This may result in a smaller network with fewer factors and variables. For example, to compute P(AlarmFire=true), in the running example, the variables Report, Leaving, and Smoke may be pruned.

The complexity of the algorithms depends on a measure of complexity of the network. The size of a tabular representation of a factor is exponential in the number of variables in the factor. The treewidth of a network for an elimination ordering is the maximum number of variables in a factor created by variable elimination summing out the variables in the elimination ordering. The treewidth of a belief network is the minimum treewidth over all elimination orderings. The treewidth depends only on the graph structure and is a measure of the sparseness of the graph. The complexity of recursive conditioning and variable elimination is exponential in the treewidth and linear in the number of variables. Finding the elimination ordering with minimum treewidth is NP-hard, but there are some good elimination ordering heuristics, as discussed for CSP VE.

Example 9.29.

Consider the belief network of Figure 9.4. To compute the probability of Sneezing, the variables Fever and "Achoo"sound may be pruned, as they have no children and are not observed or queried. Summing out Season involves multiplying the factors

P(Season),P(PollenSeason),P(InfluenzaSeason)

and results in a factor on Influenza and Pollen. The treewidth of this belief network is 2; there is an ordering of the variables that only constructs factors of size 1 or 2, and there is no ordering of the variables that has a smaller treewidth.

Many modern exact algorithms compile the network into a secondary structure, by essentially carrying out variable elimination or recursive conditioning symbolically, summing out all of the variables, to produce a probabilistic circuit that just contains the symbolic variables combined by sum and product. The circuit is like Figure 9.11, where the branches in (c) form the sums and the conditional probabilities in (b) are multiplied at the appropriate place, as in the factorization of Example 9.22. The circuit is an arithmetic expression with shared structure. When all variables are summed out, the circuit provides an expensive way to compute the value 1. When some variables are observed, the circuit uses the observed values instead of summing out these variables; the output is the probability of the evidence. When a query variable is then set to a value, the circuit outputs the probability of the evidence and the query variable, from which a conditional probability can be computed. It is possible to compute the posterior probability of all variables with two passes through the circuit, but that is beyond the scope of this book.

Compilation is appropriate when the same belief network is used for many multiple queries, or where observations are added incrementally. Unfortunately, extensive preprocessing, allowing arbitrary sequences of observations and deriving the posterior on each variable, precludes pruning the network. So for each application you need to choose whether you will save more by pruning irrelevant variables for each query or by preprocessing before there are any observations or queries.