foundations of computational agents
The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).
The variable elimination (VE) algorithm, as used for finding solutions to CSPs and for optimization with soft constraints, can be adapted to find the posterior distribution for a variable in a belief network given conjunctive evidence. Many of the efficient exact methods are variants of this algorithm.
The algorithm is based on the notion that a belief network specifies a factorization of the joint probability distribution.
Before we provide the algorithm, we define factors and the operations that will be performed on them. Recall that is a function from variables (or sets of variables) and into the real numbers that, given a value for and a value for , returns the conditional probability of the value for , given the value for . A function of variables is called a factor. The VE algorithm for belief networks manipulates factors to compute posterior probabilities.
A conditional probability, is a function from the variables into non-negative numbers that satisfies the constraints that for each assignment of values to all of the values for sum to 1. That is, given values to all of the variables, the function returns a number that satisfies the constraint:
(8.1) |
With a finite set of variables with finite domains, conditional probabilities can be implemented as arrays. If there is an ordering of the variables (e.g., alphabetical) and the values in the domains are mapped into non-negative integers, there is a unique representation of each factor as a one-dimensional array that is indexed by natural numbers. This representation for a conditional probability is called a conditional probability table or CPT.
If the child variable is treated the same as the parent variables, the information is redundant; more numbers are specified than is required and a table could be inconsistent if it does not satisfy the above constraint. Using the redundant representation is common, but the following two methods are also used to specify and store probabilities:
Store unnormalized probabilities, which are non-negative numbers that are proportional to the probability. The probability can be computed by normalizing: dividing each value by the sum of the values, summing over all values for the domain of .
Store the probability for all-but-one of the values of . In this case, the probability of this other value can be computed to obey the constraint above. In particular, if is binary, we only need to represent the probability for one value, say , and the probability for other other value, , can be computed from this.
0.9 | |
0.01 |
0.5 | ||
0.99 | ||
0.85 | ||
0.0001 |
0.1 | ||
0.2 | ||
0.4 | ||
0.3 |
Figure 8.6 shows three conditional probabilities tables. On the top left is and on the top right is , from Example 8.15, which use Boolean variables.
These tables do not specify the probability for the child being false. This can be computed from the given probabilities, for example,
On the bottom is a simple example, with domains , which will be used in the following examples.
Given a total ordering of the parents, such as is before in the right table, and a total ordering of the values, such as is before , the table can be specified by giving the array of numbers in lexicographic order, such as .
A factor is a function from a set of random variables into a number. A factor on variables is written as . The variables are the variables of the factor , and is a factor on .
Conditional probabilities are factors that also obey the constraint of Equation 8.1. This section describes some operations on factors, including conditioning, multiplying factors and summing out variables. The operations can be used for conditional probabilities, but do not necessarily result in conditional probabilities.
Suppose is a factor and each is an element of the domain of . is a number that is the value of when each has value . Some of the variables of a factor can be assigned to values to make a new factor on the other variables. This operation is called conditioning on the values of the variables that are assigned. For example, , sometimes written as , where is an element of the domain of variable , is a factor on .
Figure 8.7 shows a factor on variables , and as a table. This assumes that each variable is binary with domain . This factor could be obtained from the last conditional probability table given in Figure 8.6. Figure 8.7 also gives a table for the factor , which is a factor on . Similarly, is a factor on , and is a number.
Factors can be multiplied together. Suppose and are factors, where is a factor that contains variables and , and is a factor with variables and , where are the variables in common to and . The product of and , written , is a factor on the union of the variables, namely , defined by:
|
|||||||||||||||
|
val 0.03 0.07 0.54 0.36 0.06 0.14 0.48 0.32
Figure 8.8 shows the product of and , which is a factor on . Note that .
The remaining operation is to sum out a variable in a factor. Given factor , summing out a variable, say , results in a factor on the other variables, , defined by
where is the set of possible values of variable .
Figure 8.9 gives an example of summing out variable from a factor , which is a factor on . Notice how
Given evidence , …, , and query variable or variables , the problem of computing the posterior distribution on can be reduced to the problem of computing the probability of conjunctions:
The algorithm computes the factor and normalizes. Note that this is a factor only of ; given a value for , it returns a number that is the probability of the conjunction of the evidence and the value for .
Suppose the variables of the belief network are . To compute the factor , sum out the other variables from the joint distribution. Suppose is an enumeration of the other variables in the belief network, that is,
and the variables are ordered according to an elimination ordering.
The probability of conjoined with the evidence is
The belief network inference problem is thus reduced to a problem of summing out a set of variables from a product of factors. The distribution law specifies that a sum of products such as , can be simplified by distributing out the common factors (here ), which results in . The resulting form is more efficient to compute. Distributing out common factors is the essence of the VE algorithm. The elements multiplied together are called “factors” because of the use of the term in algebra. Initially, the factors represent the conditional probability distributions, but the intermediate factors are just functions on variables created by adding and multiplying factors.
To compute the posterior distribution of a query variable given observations
Construct a factor for each conditional probability distribution.
Eliminate each of the non-query variables:
if the variable is observed, its value is set to the observed value in each of the factors in which the variable appears,
otherwise the variable is summed out.
Multiply the remaining factors and normalize.
To sum out a variable from a product of factors, first partition the factors into those not containing , say , and those containing , ; then distribute the common factors out of the sum:
VE explicitly constructs a representation (in terms of a multidimensional array, a tree, or a set of rules) of the rightmost factor.
Figure 8.10 gives pseudocode for the VE algorithm. The elimination ordering could be given a priori or computed on the fly. It is worthwhile to select observed variables first in the elimination ordering, because eliminating these simplifies the problem.
This algorithm assumes that the query variable is not observed. If it is observed to have a particular value, its posterior probability is just 1 for the observed value and 0 for the other values.
Consider Example 8.15 with the query
Suppose it first eliminates the observed variables, and . After these are eliminated, the following factors remain:
Suppose is next in the elimination ordering. To eliminate , collect all of the factors containing , namely , , and , multiply them together, and sum out from the resulting factor. Call this factor . At this stage, contains the factors:
Suppose is eliminated next. VE multiplies the factors containing and sums out from the product, giving a factor, call it :
then contains the factors:
Eliminating results in the factor
To determine the distribution over , multiply the remaining factors, giving
The posterior distribution over tampering is given by
Note that the denominator is the prior probability of the evidence, namely
Consider the same network as in the previous example but with the following query:
When is eliminated, the factor becomes a factor of no variables; it is just a number, .
Suppose is eliminated next. It is in one factor, which represents . Summing over all of the values of gives a factor on , all of whose values are 1. This is because for any value of .
If is eliminated next, a factor that is all 1 is multiplied by a factor representing and is summed out. This, again, results in a factor all of whose values are 1.
Similarly, eliminating results in a factor of no variables, whose value is . Note that even if smoke had also been observed, eliminating would result in a factor of no variables, which would not affect the posterior distribution on .
Eventually, there is only the factor on that represents its prior probability and a constant factor that will cancel in the normalization.
To speed up the inference, variables that are irrelevant to answer a query given the observations can be pruned. In particular, any node that has no observed or queried descendants and is itself not observed or queried may be pruned. This may result in a smaller network with fewer factors and variables. For example, to compute , the variables , and may be pruned.
The complexity of the algorithm depends on a measure of complexity of the network. The size of a tabular representation of a factor is exponential in the number of variables in the factor. The treewidth of a network, given an elimination ordering, is the maximum number of variables in a factor created by summing out a variable when using the elimination ordering. The treewidth of a belief network is the minimum treewidth over all elimination orderings. The treewidth depends only on the graph structure and is a measure of the sparseness of the graph. The complexity of VE is exponential in the treewidth and linear in the number of variables. Finding the elimination ordering with minimum treewidth is NP-hard, but there are some good elimination ordering heuristics, as discussed for CSP VE.
Consider the belief network of Figure 8.4. To compute the probability of , the variables and may be pruned, as they have no children and are not observed or queried. Summing out involves multiplying the factors
and results in a factor on and . The treewidth of this belief network is two; there is an ordering of the variables that only constructs factors of size one or two, and there is no ordering of the variables that has a smaller treewidth.
The moral graph of a belief network is the undirected graph where there is an arc between any two nodes that appear in the same initial factor. This is obtained by “marrying” the parents of a node, and removing the directions. If we prune as outlined in the previous paragraph, moralize the graph, and remove all observed variables, only those variables connected to the query in this graph are relevant to answering the query. The other variables can be pruned.
Many modern exact algorithms use what is essentially variable elimination, and they speed it up by preprocessing as much as possible into a secondary structure before any evidence arrives. This is appropriate when, for example, the same belief network may be used for many different queries, and where observations are added incrementally. The algorithms save intermediate results so that evidence is incrementally added. Unfortunately, extensive preprocessing, allowing arbitrary sequences of observations and deriving the posterior on each variable, precludes pruning the network. So for each application you need to choose whether you will save more by pruning irrelevant variables for each query and observation or by preprocessing before you have any observations.