Artificial Intelligence - foundations of computational agents -- 6.3 Belief Networks

Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

6.3 Belief Networks

The notion of conditional independence can be used to give a concise representation of many domains. The idea is that, given a random variable X, a small set of variables may exist that directly affect the variable's value in the sense that X is conditionally independent of other variables given values for the directly affecting variables. The set of locally affecting variables is called the Markov blanket. This locality is what is exploited in a belief network. A belief network is a directed model of conditional dependence among a set of random variables. The precise statement of conditional independence in a belief network takes into account the directionality.

To define a belief network, start with a set of random variables that represent all of the features of the model. Suppose these variables are {X₁,...,X_n}. Next, select a total ordering of the variables, X₁,...,X_n.

The chain rule (Proposition 6.3) shows how to decompose a conjunction into conditional probabilities:

P(X₁=v₁∧X₂=v₂∧···∧X_n=v_n)

= ∏_i=1ⁿ P(X_i=v_i|X₁=v₁∧···∧X_i-1=v_i-1).

Or, in terms of random variables and probability distributions,

P(X₁, X₂,···, X_n) = ∏_i=1ⁿ P(X_i|X₁, ···, X_i-1).

Define the parents of random variable X_i, written parents(X_i), to be a minimal set of predecessors of X_i in the total ordering such that the other predecessors of X_i are conditionally independent of X_i given parents(X_i). That is, parents(X_i) ⊆{X₁,...,X_i-1} such that

P(X_i|X_i-1...X₁) = P(X_i|parents(X_i)).

If more than one minimal set exists, any minimal set can be chosen to be the parents. There can be more than one minimal set only when some of the predecessors are deterministic functions of others.

We can put the chain rule and the definition of parents together, giving

P(X₁, X₂,···, X_n) = ∏_i=1ⁿ P(X_i|parents(X_i)).

The probability over all of the variables, P(X₁, X₂,···, X_n), is called the joint probability distribution. A belief network defines a factorization of the joint probability distribution, where the conditional probabilities form factors that are multiplied together.

A belief network, also called a Bayesian network, is an acyclic directed graph (DAG), where the nodes are random variables. There is an arc from each element of parents(X_i) into X_i. Associated with the belief network is a set of conditional probability distributions - the conditional probability of each variable given its parents (which includes the prior probabilities of those variables with no parents).

Thus, a belief network consists of

a DAG, where each node is labeled by a random variable;
a domain for each random variable; and
a set of conditional probability distributions giving P(X|parents(X)) for each variable X.

A belief network is acyclic by construction. The way the chain rule decomposes the conjunction gives the ordering. A variable can have only predecessors as parents. Different decompositions can result in different belief networks.

Example 6.10: Suppose we want to use the diagnostic assistant to diagnose whether there is a fire in a building based on noisy sensor information and possibly conflicting explanations of what could be going on. The agent receives a report about whether everyone is leaving the building. Suppose the report sensor is noisy: It sometimes reports leaving when there is no exodus (a false positive), and it sometimes does not report when everyone is leaving (a false negative). Suppose the fire alarm going off can cause the leaving, but this is not a deterministic relationship. Either tampering or fire could affect the alarm. Fire also causes smoke to rise from the building.

Suppose we use the following variables, all of which are Boolean, in the following order:

Tampering is true when there is tampering with the alarm.
Fire is true when there is a fire.
Alarm is true when the alarm sounds.
Smoke is true when there is smoke.
Leaving is true if there are many people leaving the building at once.
Report is true if there is a report given by someone of people leaving. Report is false if there is no report of leaving.

The variable Report denotes the sensor report that people are leaving. This information is unreliable because the person issuing such a report could be playing a practical joke, or no one who could have given such a report may have been paying attention. This variable is introduced to allow conditioning on unreliable sensor data. The agent knows what the sensor reports, but it only has unreliable evidence about people leaving the building. As part of the domain, assume the following conditional independencies:

Fire is conditionally independent of Tampering (given no other information).
Alarm depends on both Fire and Tampering. That is, we are making no independence assumptions about how Alarm depends on its predecessors given this variable ordering.
Smoke depends only on Fire and is conditionally independent of Tampering and Alarm given whether there is a Fire.
Leaving only depends on Alarm and not directly on Fire or Tampering or Smoke. That is, Leaving is conditionally independent of the other variables given Alarm.
Report only directly depends on Leaving.

The belief network of Figure 6.1 expresses these dependencies.

Figure 6.1: Belief network for report of leaving of Example 6.10

This network represents the factorization

P(Tampering,Fire,Alarm,Smoke,Leaving,Report)

= P(Tampering) ×P(Fire) ×P(Alarm|Tampering,Fire)

×P(Smoke|Fire) ×P(Leaving|Alarm) ×P(Report|Leaving).

We also must define the domain of each variable. Assume that the variables are Boolean; that is, they have domain {true,false}. We use the lower-case variant of the variable to represent the true value and use negation for the false value. Thus, for example, Tampering=true is written as tampering, and Tampering=false is written as ¬tampering.

The examples that follow assume the following conditional probabilities:

For each wire w_i, there is a random variable, W_i, with domain {live,dead}, which denotes whether there is power in wire w_i. W_i=live means wire w_i has power. W_i=dead means there is no power in wire w_i.
Outside_power with domain {live,dead} denotes whether there is power coming into the building.
For each switch s_i, variable S_i_pos denotes the position of s_i. It has domain {up,down}.
For each switch s_i, variable S_i_st denotes the state of switch s_i. It has domain {ok,upside_down,short,intermittent,broken}. S_i_st=ok means switch s_i is working normally. S_i_st=upside_down means switch s_i is installed upside-down. S_i_st=short means switch s_i is shorted and acting as a wire. S_i_st=broken means switch s_i is broken and does not allow electricity to flow.
For each circuit breaker cb_i, variable Cb_i_st has domain {on,off}. Cb_i_st=on means power can flow through cb_i and Cb_i_st=off means that power cannot flow through cb_i.
For each light l_i, variable L_i_st with domain {ok,intermittent,broken} denotes the state of the light. L_i_st=ok means light l_i will light if powered, L_i_st=intermittent means light l_i intermittently lights if powered, and L_i_st=broken means light l_i does not work.

Figure 6.2: Belief network for the electrical domain of Figure 1.8

Example 6.11: Consider the wiring example of Figure 1.8. Suppose we decide to have variables for whether lights are lit, for the switch positions, for whether lights and switches are faulty or not, and for whether there is power in the wires. The variables are defined in Figure 6.2.

Let's select an ordering where the causes of a variable are before the variable in the ordering. For example, the variable for whether a light is lit comes after variables for whether the light is working and whether there is power coming into the light.

Whether light l₁ is lit depends only on whether there is power in wire w₀ and whether light l₁ is working properly. Other variables, such as the position of switch s₁, whether light l₂ is lit, or who is the Queen of Canada, are irrelevant. Thus, the parents of L₁_lit are W₀ and L₁_st.

Consider variable W₀, which represents whether there is power in wire w₀. If we knew whether there was power in wires w₁ and w₂, and we knew the position of switch s₂ and whether the switch was working properly, the value of the other variables (other than L₁_lit) would not affect our belief in whether there is power in wire w₀. Thus, the parents of W₀ should be S₂_Pos, S₂_st, W₁, and W₂.

Figure 6.2 shows the resulting belief network after the independence of each variable has been considered. The belief network also contains the domains of the variables, as given in the figure, and conditional probabilities of each variable given its parents.

For the variable W₁, the following conditional probabilities must be specified:

P(W₁=live|S₁_pos=up ∧S₁_st=ok ∧W₃=live)

P(W₁=live|S₁_pos=up ∧S₁_st=ok ∧W₃=dead)

P(W₁=live|S₁_pos=up ∧S₁_st=upside_down ∧W₃=live)

...

P(W₁=live|S₁_pos=down ∧S₁_st=broken ∧W₃=dead).

There are two values for S₁_pos, five values for S₁_ok, and two values for W₃, so there are 2×5 ×2 = 20 different cases where a value for W₁=live must be specified. As far as probability theory is concerned, the probability for W₁=live for these 20 cases could be assigned arbitrarily. Of course, knowledge of the domain constrains what values make sense. The values for W₁=dead can be computed from the values for W₁=live for each of these cases.

Because the variable S₁_st has no parents, it requires a prior distribution, which can be specified as the probabilities for all but one of the values; the remaining value can be derived from the constraint that all of the probabilities sum to 1. Thus, to specify the distribution of S₁_st, four of the following five probabilities must be specified:

P(S₁_st=ok)

P(S₁_st=upside_down)

P(S₁_st=short)

P(S₁_st=intermittent)

P(S₁_st=broken)

The other variables are represented analogously.

A belief network is a graphical representation of conditional independence. The independence allows us to depict direct effects within the graph and prescribes which probabilities must be specified. Arbitrary posterior probabilities can be derived from the network.

The independence assumption embedded in a belief network is as follows: Each random variable is conditionally independent of its non-descendants given its parents. That is, if X is a random variable with parents Y₁,..., Y_n, all random variables that are not descendants of X are conditionally independent of X given Y₁ ,..., Y_n:

P(X|Y₁,..., Y_n,R)=P(X|Y₁,..., Y_n),

if R does not involve a descendant of X. For this definition, we include X as a descendant of itself. The right-hand side of this equation is the form of the probabilities that are specified as part of the belief network. R may involve ancestors of X and other nodes as long as they are not descendants of X. The independence assumption states that all of the influence of non-descendant variables is captured by knowing the value of X's parents.

Often, we refer to just the labeled DAG as a belief network. When this is done, it is important to remember that a domain for each variable and a set of conditional probability distributions are also part of the network.

The number of probabilities that must be specified for each variable is exponential in the number of parents of the variable. The independence assumption is useful insofar as the number of variables that directly affect another variable is small. You should order the variables so that nodes have as few parents as possible.

Belief Networks and Causality

Belief networks have often been called causal networks and have been claimed to be a good representation of causality. Recall that a causal model predicts the result of interventions. Suppose you have in mind a causal model of a domain, where the domain is specified in terms of a set of random variables. For each pair of random variables X₁ and X₂, if a direct causal connection exists from X₁ to X₂ (i.e., intervening to change X₁ in some context of other variables affects X₂ and this cannot be modeled by having some intervening variable), add an arc from X₁ to X₂. You would expect that the causal model would obey the independence assumption of the belief network. Thus, all of the conclusions of the belief network would be valid.

You would also expect such a graph to be acyclic; you do not want something eventually causing itself. This assumption is reasonable if you consider that the random variables represent particular events rather than event types. For example, consider a causal chain that "being stressed" causes you to "work inefficiently," which, in turn, causes you to "be stressed." To break the apparent cycle, we can represent "being stressed" at different stages as different random variables that refer to different times. Being stressed in the past causes you to not work well at the moment which causes you to be stressed in the future. The variables should satisfy the clarity principle and have a well-defined meaning. The variables should not be seen as event types.

The belief network itself has nothing to say about causation, and it can represent non-causal independence, but it seems particularly appropriate when there is causality in a domain. Adding arcs that represent local causality tends to produce a small belief network. The belief network of Figure 6.2 shows how this can be done for a simple domain.

A causal network models interventions. If someone were to artificially force a variable to have a particular value, the variable's descendants - but no other nodes - would be affected. Finally, you can see how the causality in belief networks relates to the causal and evidential reasoning discussed in Section 5.7. A causal belief network can be seen as a way of axiomatizing in a causal direction. Reasoning in belief networks corresponds to abducing to causes and then predicting from these. A direct mapping exists between the logic-based abductive view discussed in Section 5.7 and belief networks: Belief networks can be modeled as logic programs with probabilities over possible hypotheses. This is described in Section 14.3.

Note the restriction "each random variable is conditionally independent of its non-descendants given its parents" in the definition of the independence encoded in a belief network. If R contains a descendant of variable X, the independence assumption is not directly applicable.

Example 6.12: In Figure 6.2, variables S₃_pos, S₃_st, and W₃ are the parents of variable W₄. If you know the values of S₃_pos, S₃_st, and W₃, knowing whether or not l₁ is lit or knowing the value of Cb₁_st will not affect your belief in whether there is power in wire w₄. However, even if you knew the values of S₃_pos, S₃_st, and W₃, learning whether l₂ is lit potentially changes your belief in whether there is power in wire w₁. The independence assumption is not directly applicable.

The variable S₁_pos has no parents. Thus, the independence embedded in the belief network specifies that P(S₁_pos=up|A) = P(S₁_pos=up) for any A that does not involve a descendant of S₁_pos. If A includes a descendant of S₁_pos=up - for example, if A is S₂_pos=up∧L₁_lit=true - the independence assumption cannot be directly applied.

This network can be used in a number of ways:

By conditioning on the knowledge that the switches and circuit breakers are ok, and on the values of the outside power and the position of the switches, this network can simulate how the lighting should work.
Given values of the outside power and the position of the switches, the network can infer the likelihood of any outcome - for example, how likely it is that l₁ is lit.
Given values for the switches and whether the lights are lit, the posterior probability that each switch or circuit breaker is in any particular state can be inferred.
Given some observations, the network can be used to reason backward to determine the most likely position of switches.
Given some switch positions, some outputs, and some intermediate values, the network can be used to determine the probability of any other variable in the network.

A belief network specifies a joint probability distribution from which arbitrary conditional probabilities can be derived. A network can be queried by asking for the conditional probability of any variables conditioned on the values of any other variables. This is typically done by providing observations on some variables and querying another variable.

Example 6.13: Consider Example 6.10. The prior probabilities (with no evidence) of each variable can be computed using the methods of the next section. The following conditional probabilities follow from the model of Example 6.10, to about three decimal places:

P(tampering ) = 0.02
P(fire) = 0.01
P(report ) = 0.028
P(smoke) = 0.0189

Observing the report gives the following:

P(tampering|report) = 0.399
P(fire |report)= 0.2305
P(smoke |report) = 0.215

As expected, the probability of both tampering and fire are increased by the report. Because fire is increased, so is the probability of smoke.

Suppose instead that smoke were observed:

P(tampering|smoke) = 0.02
P(fire|smoke) = 0.476
P(report |smoke) = 0.320

Note that the probability of tampering is not affected by observing smoke; however, the probabilities of report and fire are increased.

Suppose that both report and smoke were observed:

P(tampering |report ∧smoke) = 0.0284
P(fire |report ∧smoke) = 0.964

Observing both makes fire even more likely. However, in the context of the report, the presence of smoke makes tampering less likely. This is because the report is explained away by fire, which is now more likely.

Suppose instead that report, but not smoke, was observed:

P(tampering |report ∧¬smoke) = 0.501
P(fire|report ∧¬smoke) = 0.0294

In the context of the report, fire becomes much less likely and so the probability of tampering increases to explain the report.

This example illustrates how the belief net independence assumption gives commonsense conclusions and also demonstrates how explaining away is a consequence of the independence assumption of a belief network.

6.3.1 Constructing Belief Networks

P(X₁=v₁∧X₂=v₂∧···∧X_n=v_n)
	=	∏_i=1ⁿ P(X_i=v_i\|X₁=v₁∧···∧X_i-1=v_i-1).

P(Tampering,Fire,Alarm,Smoke,Leaving,Report)
	=	P(Tampering) ×P(Fire) ×P(Alarm\|Tampering,Fire)
		×P(Smoke\|Fire) ×P(Leaving\|Alarm) ×P(Report\|Leaving).

P(W₁=live\|S₁_pos=up ∧S₁_st=ok ∧W₃=live)
P(W₁=live\|S₁_pos=up ∧S₁_st=ok ∧W₃=dead)
P(W₁=live\|S₁_pos=up ∧S₁_st=upside_down ∧W₃=live)
	...
P(W₁=live\|S₁_pos=down ∧S₁_st=broken ∧W₃=dead).