9.3 Belief Networks

The notion of conditional independence is used to give a concise representation of many domains. The idea is that, given a random variable X, there may be a few variables that directly affect the X’s value, in the sense that X is conditionally independent of other variables given these variables. The set of locally affecting variables is called the Markov blanket. This locality is exploited in a belief network.

A belief network is a directed acyclic graph representing conditional dependence among a set of random variables. The random variables are the nodes. The arcs represent direct dependence. The conditional independence implied by a belief network is determined by an ordering of the variables; each variable is independent of its predecessors in the total ordering given a subset of the predecessors called its parents. Independence in the graph is indicated by missing arcs.

To define a belief network on a set of random variables, {X1,,Xn}, first select a total ordering of the variables, say, X1,,Xn. The chain rule (Proposition 9.1.2) shows how to decompose a conjunction into conditional probabilities:

P(X1 =v1X2=v2Xn=vn)
=i=1nP(Xi=viX1=v1Xi1=vi1).

Or, in terms of random variables:

P(X1,X2,,Xn) =i=1nP(XiX1,,Xi1). (9.1)

Define the parents of random variable Xi, written parents(Xi), to be a minimal set of predecessors of Xi in the total ordering such that the other predecessors of Xi are conditionally independent of Xi given parents(Xi). Thus Xi probabilistically depends on each of its parents, but is independent of its other predecessors. That is, parents(Xi){X1,,Xi1} such that

P(XiX1,,Xi1)=P(Xiparents(Xi)).

This conditional independence characterizes a belief network.

When there are multiple minimal sets of predecessors satisfying this condition, any minimal set may be chosen to be the parents. There can be more than one minimal set only when some of the predecessors are deterministic functions of others.

Putting the chain rule and the definition of parents together gives

P(X1,X2,,Xn) =i=1nP(Xiparents(Xi)).

The probability over all of the variables, P(X1,X2,,Xn), is called the joint probability distribution. A belief network defines a factorization of the joint probability distribution into a product of conditional probabilities.

A belief network, also called a Bayesian network, is an acyclic directed graph (DAG), where the nodes are random variables. There is an arc from each element of parents(Xi) into Xi. Associated with the belief network is a set of conditional probability distributions that specify the conditional probability of each variable given its parents (which includes the prior probabilities of those variables with no parents).

Thus, a belief network consists of

  • a DAG, where each node is labeled by a random variable

  • a domain for each random variable, and

  • a set of conditional probability distributions giving P(Xparents(X)) for each variable X.

A belief network is acyclic by construction. How the chain rule decomposes a conjunction depends on the ordering of the variables. Different orderings can result in different belief networks. In particular, which variables are eligible to be parents depends on the ordering, as only predecessors in the ordering can be parents. Some of the orderings may result in networks with fewer arcs than other orderings.

Example 9.11.

Consider the four variables of Example 9.10, with the ordering: Intelligent, Works_hard, Answers, Grade. Consider the variables in order. Intelligent does not have any predecessors in the ordering, so it has no parents, thus parents(Intelligent)={}. Works_hard is independent of Intelligent, and so it too has no parents. Answers depends on both Intelligent and Works_hard, so

parents(Answers)={Intelligent,Works_hard}.

Grade is independent of Intelligent and Works_hard given Answers and so

parents(Grade)={Answers}.

The corresponding belief network is given in Figure 9.2.

This graph defines the decomposition of the joint distribution:

P(Inte lligent,Works_hard,Answers,Grade)
= P(Intelligent)P(Works_hard)P(AnswersIntelligent,Works_hard)
P(GradeAnswers).

In the examples below, the domains of the variables are simple, for example the domain of Answers may be {insightful,clear,superficial,vacuous} or it could be the actual text answers.

Refer to caption
Figure 9.2: Belief network for exam answering of Example 9.11

The independence of a belief network, according to the definition of parents, is that each variable is independent of all of the variables that are not descendants of the variable (its non-descendants) given the variable’s parents.

9.3.1 Observations and Queries

A belief network specifies a joint probability distribution from which arbitrary conditional probabilities can be derived. The most common probabilistic inference task – the task required for decision making – is to compute the posterior distribution of a query variable, or variables, given some evidence, where the evidence is a conjunction of assignment of values to some of the variables.

Example 9.12.

Before there are any observations, the distribution over intelligence is P(Intelligent), which is provided as part of the network. To determine the distribution over grades, P(Grade), requires inference.

If a grade of A is observed, the posterior distribution of Intelligent is given by

P(IntelligentGrade=A).

If it was also observed that Works_hard is false, the posterior distribution of Intelligent is

P(IntelligentGrade=AWorks_hard=false).

Although Intelligent and Works_hard are independent given no observations, they are dependent given the grade. This might explain why some people claim they did not work hard to get a good grade; it increases the probability they are intelligent.

9.3.2 Constructing Belief Networks

To represent a domain in a belief network, the designer of a network must consider the following questions.

  • What are the relevant variables? In particular, the designer must consider:

    • What the agent may observe in the domain. Each feature that may be observed should be a variable, because the agent must be able to condition on all of its observations.

    • What information the agent is interested in knowing the posterior probability of. Each of these features should be made into a variable that can be queried.

    • Other hidden variables or latent variables that will not be observed or queried but make the model simpler. These variables either account for dependencies, reduce the size of the specification of the conditional probabilities, or better model how the world is assumed to work.

  • What values should these variables take? This involves considering the level of detail at which the agent should reason to answer the sorts of queries that will be encountered.

    For each variable, the designer should specify what it means to take each value in its domain. What must be true in the world for a (non-hidden) variable to have a particular value should satisfy the clarity principle: an omniscient agent should be able to know the value of a variable. It is a good idea to explicitly document the meaning of all variables and their possible values. The only time the designer may not want to do this is for hidden variables whose values the agent will want to learn from data [see Section 10.4.1].

  • What is the relationship between the variables? This should be expressed by adding arcs in the graph to define the parent relation.

  • How does the distribution of a variable depend on its parents? This is expressed in terms of the conditional probability distributions.

Example 9.13.

Suppose you want to use the diagnostic assistant to diagnose whether there is a fire in a building and whether there has been some tampering with equipment based on noisy sensor information and possibly conflicting explanations of what could be going on. The agent receives a report from Sam about whether everyone is leaving the building. Suppose Sam’s report is noisy: Sam sometimes reports leaving when there is no exodus (a false positive), and sometimes does not report when everyone is leaving (a false negative). Suppose that leaving only depends on the fire alarm going off. Either tampering or fire could affect the alarm. Whether there is smoke only depends on whether there is fire.

Suppose you use the following variables in the following order:

  • Tampering is true when there is tampering with the alarm.

  • Fire is true when there is a fire.

  • Alarm is true when the alarm sounds.

  • Smoke is true when there is smoke.

  • Leaving is true if there are many people leaving the building at once.

  • Report is true if Sam reports people leaving. Report is false if there is no report of leaving.

Assume the following conditional independencies:

  • Fire is conditionally independent of Tampering (given no other information).

  • Alarm depends on both Fire and Tampering. This is making no independence assumptions about how Alarm depends on its predecessors given this variable ordering.

  • Smoke depends only on Fire and is conditionally independent of Tampering and Alarm given whether there is Fire.

  • Leaving only depends on Alarm and not directly on Fire or Tampering or Smoke. That is, Leaving is conditionally independent of the other variables given Alarm.

  • Report only directly depends on Leaving.

The belief network of Figure 9.3 expresses these dependencies.

Refer to caption
Figure 9.3: Belief network for report of leaving of Example 9.13

This network represents the factorization

P(Tam pering,Fire,Alarm,Smoke,Leaving,Report)
= P(Tampering)P(Fire)P(AlarmTampering,Fire)
P(SmokeFire)P(LeavingAlarm)P(ReportLeaving).

Note that the alarm is not a smoke alarm, which would be affected by the smoke, and not directly by the fire, but rather it is a heat alarm that is directly affected by the fire. This is made explicit in the model in that Alarm is independent of Smoke given Fire.

You also must define the domain of each variable. Assume that the variables are Boolean; that is, they have domain {true,false}. We use the lower-case variant of the variable to represent the true value and use negation for the false value. Thus, for example, Tampering=true is written as tampering, and Tampering=false is written as ¬tampering.

The examples that follow assume the following conditional probabilities:

P(tampering)=0.02

P(fire)=0.01

P(alarmfiretampering)=0.5

P(alarmfire¬tampering)=0.99

P(alarm¬firetampering)=0.85

P(alarm¬fire¬tampering)=0.0001

P(smokefire)=0.9

P(smoke¬fire)=0.01

P(leavingalarm)=0.88

P(leaving¬alarm)=0.001

P(reportleaving)=0.75

P(report¬leaving)=0.01.

Before any evidence arrives, the probability is given by the priors. The following probabilities follow from the model (all of the numbers here are to about three decimal places):

P(tampering)=0.02
P(fire)=0.01
P(report)=0.028
P(smoke)=0.0189.

Observing a report gives the following:

P(tamperingreport)=0.399
P(firereport)=0.2305
P(smokereport)=0.215.

As expected, the probabilities of both tampering and fire are increased by the report. Because the probability of fire is increased, so is the probability of smoke.

Suppose instead that smoke alone was observed:

P(tamperingsmoke)=0.02
P(firesmoke)=0.476
P(reportsmoke)=0.320.

Note that the probability of tampering is not affected by observing smoke; however, the probabilities of report and fire are increased.

Suppose that both report and smoke were observed:

P(tamperingreportsmoke)=0.0284
P(firereportsmoke)=0.964.

Observing both makes fire even more likely. However, in the context of report, the presence of smoke makes tampering less likely. This is because report is explained away by fire, which is now more likely.

Suppose instead that report, but not smoke, was observed:

P(tamperingreport¬smoke)=0.501
P(firereport¬smoke)=0.0294.

In the context of report, fire becomes much less likely and so the probability of tampering increases to explain report.

This example illustrates how the belief net independence assumption gives commonsense conclusions and also demonstrates how explaining away is a consequence of the independence assumption of a belief network.

Refer to caption
Figure 9.4: Belief network for Example 9.14
Example 9.14.

Consider the problem of diagnosing why someone is sneezing and perhaps has a fever. Sneezing could be because of influenza or because of hay fever. They are not independent, but are correlated due to the season. Suppose hay fever depends on the season because it depends on the amount of pollen, which in turn depends on the season. The agent does not get to observe sneezing directly, but rather observed just the “Achoo” sound. Suppose fever depends directly on influenza. These dependency considerations lead to the belief network of Figure 9.4.

Refer to caption
  • For each wire wi, there is a random variable, Wi, with domain {live,dead}, which denotes whether there is power in wire wi. Wi=live means wire wi has power. Wi=dead means there is no power in wire wi.

  • Outside_power with domain {live,dead} denotes whether there is power coming into the building.

  • For each switch si, variable Si_pos denotes the position of si. It has domain {up,down}.

  • For each switch si, variable Si_st denotes the state of switch si. It has domain {ok,upside_down,short,intermittent,broken}. Si_st=ok means switch si is working normally. Si_st=upside_down means switch si is installed upside-down. Si_st=short means switch si is shorted and acting as a wire. Si_st=broken means switch si is broken and does not allow electricity to flow.

  • For each circuit breaker cbi, variable Cbi_st has domain {on,off}. Cbi_st=on means power could flow through cbi and Cbi_st=off means power could not flow through cbi.

  • For each light li, variable Li_st with domain {ok,intermittent,broken} denotes the state of the light. Li_st=ok means light li will light if powered, Li_st=intermittent means light li intermittently lights if powered, and Li_st=broken means light li does not work.

Figure 9.5: Belief network for the electrical domain of Figure 1.6
Example 9.15.

Consider the wiring example of Figure 1.6. Let’s have variables for whether lights are lit, for the switch positions, for whether lights and switches are faulty or not, and for whether there is power in the wires. The variables are defined in Figure 9.5.

Let’s try to order the variables so that each variable has few parents. In this case there seems to be a natural causal order where, for example, the variable for whether a light is lit comes after variables for whether the light is working and whether there is power coming into the light.

Whether light l1 is lit depends only on whether there is power in wire w0 and whether light l1 is working properly. Other variables, such as the position of switch s1, whether light l2 is lit, or who is the Queen of Canada, are irrelevant. Thus, the parents of L1_lit are whether there is power in wire w0 (variable W0), and the status of light l1 (variable L1_st); see Figure 9.5 for the meaning of the variables.

Consider variable W0, which represents whether there is power in wire w0. If you knew whether there was power in wires w1 and w2, and knew the position of switch s2 and whether the switch was working properly, the value of the other variables (other than L1_lit) would not affect the belief in whether there is power in wire w0. Thus, the parents of W0 should be S2_Pos, S2_st, W1, and W2.

Figure 9.5 shows the resulting belief network after the independence of each variable has been considered. The belief network also contains the domains of the variables and conditional probabilities of each variable given its parents.

Note the independence assumption embedded in this model. The DAG specifies that the lights, switches, and circuit breakers break independently. To model dependencies among how the switches break, you could add more arcs and perhaps more variables. For example, if some lights do not break independently because they come from the same batch, you could add an extra node modeling the batch, and whether it is a good batch or a bad batch, which is made a parent of the Li_st variables for each light Li from that batch. When you have evidence that one light is broken, the probability that the batch is bad may increase and thus make it more likely that other lights from that batch are broken. If you are not sure whether the lights are indeed from the same batch, you could add variables representing this, too. The important point is that the belief network provides a specification of independence that lets us model dependencies in a natural and direct manner.

The model implies that there is no possibility of shorts in the wires or that the house is wired differently from the diagram. For example, it implies that w0 cannot be shorted to w4 so that wire w0 gets power from wire w4. You could add extra dependencies that let each possible short be modeled. An alternative is to add an extra node that indicates that the model is appropriate. Arcs from this node would lead to each variable representing power in a wire and to each light. When the model is appropriate, you could use the probabilities of Example 9.15. When the model is inappropriate, you could, for example, specify that each wire and light works at random. When there are weird observations that do not fit in with the original model – they are impossible or extremely unlikely given the model – the probability that the model is inappropriate will increase.

9.3.3 Representing Conditional Probabilities and Factors

A factor is a function of a set of variables; the variables on which it depends are the scope of the factor. Given an assignment of a value to each variable in the scope, the factor evaluates to a number.

A conditional probability is a factor representing P(YX1,,Xk), which is a function from the variables Y,X1,,Xk into nonnegative numbers. It must satisfy the constraints that for each assignment of values to all of X1,,Xk, the values for Y sum to 1. That is, given values for all of the variables, the function returns a number that satisfies the constraint

x1xkydomain(Y)P(Y=yX1=x1,,Xk=xk)=1. (9.2)

The following gives a number of ways of representing conditional probabilities, and other factors.

Conditional Probability Tables

A representation for a conditional probability that explicitly stores a value for each assignment to the variables is called a conditional probability table or CPT. This can be done whenever there is a finite set of variables with finite domains. The space used is exponential in the number of variables; the size of the table is the product of the sizes of the domains of the variables in the scope.

One such representation is to use a multidimensional table, storing P(Y=yX1=v1,,xk=vk) in p_y[v1][vk][y] (using Python’s notation for arrays, and exploiting its ambiguity of arrays and dictionaries).

Fire Tampering P(alarmFire,Tampering)
false false 0.0001
false true 0.85
true false 0.99
true true 0.5
Figure 9.6: Conditional probability of Alarm=true given Fire and Tampering
Example 9.16.

Figure 9.6 shows the conditional probabilities for P(alarmFire,Tampering). The probability for Alarm being false can be computed from the given probabilities, for example:

P(Alarm=falseFire=false,Tampering=true)=10.85=0.15

Given a total ordering of the variables, [Fire,Tampering,Alarm], the values mapped to the nonnegative integers; false to 0 and true to 1.

P(AlarmFire,Tampering) of Figure 9.6 can be represented as the Python array

cpt=[[[0.9999,0.0001],[0.15,0.85]],[[0.01,0.99],[0.5,0.5]]].

Given particular values of Fire=f,Tampering=t,Alarm=a, where f, t, and a are each 0 or 1, the value can be found at cpt[f][t][a]. If the domain of a variable is not of the form {0,,n1}, a dictionary can be used. Other languages have different syntaxes.

There are a number of refinements, which use different space-time tradeoffs, including the following.

  • Tables can be implemented as one-dimensional arrays. Given an ordering of the variables (e.g., alphabetical) and an ordering for the values, and using a mapping from the values into nonnegative integers, there is a unique representation using the lexical ordering of each factor as a one-dimensional array that is indexed by natural numbers. This is a space-efficient way to store a table.

  • If the child variable is treated the same as the parent variables, the information is redundant; more numbers are specified than is required. One way to handle this is to store the probability for all-but-one of the values of the child, Y. The probability of the other value can be computed as 1 minus the sum of other values for a set of parents. In particular, if Y is Boolean, you only need to represent the probability for one value, say Y=true given the parents; the probability for Y=false can be computed from this.

  • It is also possible to store unnormalized probabilities, which are nonnegative numbers that are proportional to the probability. The probability is computed by dividing each value by the sum of the values. This is common when the unnormalized probabilities are counts [see Section 10.2.1].

Decision Trees

The tabular representation of conditional probabilities can be enormous when there are many parents. There is often insufficient evidence – either from data or from experts – to provide all the numbers required. Fortunately, there is often structure in conditional probabilities which can be exploited.

One such structure exploits context-specific independence, where one variable is conditionally independent of another, given particular values for some other variables.

Example 9.17.

Suppose a robot can go outside or get coffee, so Action has domain {go_out,get_coffee}. Whether it gets wet (variable Wet) depends on whether there is rain (variable Rain) in the context that it went out or on whether the cup was full (variable Full) if it got coffee. Thus Wet is independent of Rain given context Action=get_coffee, but is dependent on Rain given contexts Action=go_out. Also, Wet is independent of Full given Action=go_out, but is dependent on Full given Action=get_coffee.

A conditional probability table that represents such independencies is shown in Figure 9.7.

Action Rain Full P(Wet=trueAction,Rain,Full)
go_out false false 0.1
go_out false true 0.1
go_out true false 0.8
go_out true true 0.8
get_coffee false false 0.3
get_coffee false true 0.6
get_coffee true false 0.3
get_coffee true true 0.6
Figure 9.7: A conditional probability that a robot will get wet

Context-specific independence may be exploited in a representation by not requiring numbers that are not needed. A simple representation for conditional probabilities that models context-specific independence is a decision tree, where the parents in a belief network correspond to the input features that form the splits, and the child corresponds to the target feature. Each leaf of the decision tree contains a probability distribution over the child variable.

Example 9.18.

The conditional probability P(WetAction,Rain,Full) of Figure 9.7 could be represented as a decision tree, where the number at the leaf is the probability for Wet=true:

[Uncaptioned image]

How to learn a decision tree from data was explored in Section 7.3.1. Context-specific inference can be exploited in probabilistic inference because a factor represented by a decision tree can be evaluated in an assignment when the values down a path in the tree hold; you don’t need to know the value of all parent variables.

Deterministic System with Noisy Inputs

An alternative representation of conditional distributions is in terms of a deterministic system, with probabilistic inputs. The deterministic system can range from a logical formula to a program. The inputs to the system are the parent variables and stochastic inputs. The stochastic inputs – often called noise variables or exogenous variables – can be considered as random variables that are unconditionally independent of each other. There is a deterministic system that defines the non-leaf variables of the belief network, the endogenous variables, as a deterministic function of the exogenous variables and the parents.

When the deterministic system is Clark’s completion of a logic program, it is known as probabilistic logic programming and when the deterministic system is a program, it is known as probabilistic programming. When the deterministic system is specified by a logical formula, the probabilistic inference is known as weighted model counting.

Example 9.19.

The decision tree of Example 9.7 for the conditional distribution of Figure 9.6 can be represented as the logical formula

wet ((go_outrainn0)
(go_out¬rainn1)
(¬go_outfulln2)
(¬go_out¬fulln3))

where the ni are independent noise variables, with

P(n0)=0.8,P(n1)=0.1,P(n2)=0.6,P(n3)=0.3.

So if, for example, go_out is true and rain is false, then wet is true whenever n1 is true, which occurs with probability 0.1.

This conditional distribution can be represented as a program:

if go_out:
   if rain:
      wet := flip(0.8)
   else:
      wet := flip(0.1)
else:
   if full:
      wet := flip(0.6)
   else:
      wet := flip(0.3)

flip(x) = (random() < x)
where random() returns a random number uniformly in the range [0,1), so flip(x) makes a new independent random variable that is true with probability x. The logical formula gave these random variables names.

Logical formulas allow for much more complicated expressions than presented here. Probabilistic programming allows for the full expressivity of the underlying programming language to express probabilistic models. Typically a single program is used to represent the distribution of all variables, rather than having a separate program for each conditional probability.

Noisy-or

There are many cases where something is true if there is something that makes it true. For example, someone has a symptom if there is a disease that causes that symptom; each of the causes can be probabilistic. In natural language understanding, topics may have words probabilistically associated with them; a word is used if there is a topic that makes that word appear. Each of these examples follows the same pattern, called a noisy-or.

In a noisy-or model of a conditional probability, the child is true if one of the parents is activated and each parent has a probability of activation. So the child is an “or” of the activations of the parents.

If Y has Boolean parents X1,,Xk, the noisy-or model of probability is defined by k+1 parameters w0,,wk. Y is defined in terms of a deterministic system with noisy inputs, where

Yn0(n1x1)(nkxk).

The ni are unconditionally independent noise variables, with P(ni)=wi, and xi means Xi=true.

The same distribution for P(YX1,X2,,Xk) can be defined using k+1 Boolean variables A0,A1,,Ak, where for each i>0, Ai has Xi as its only parent. See Figure 9.8.

Refer to caption
Figure 9.8: Noisy-or P(YX1,X2,,Xk) as a belief network

P(Ai=trueXi=true)=wi and P(Ai=trueXi=false)=0. The variable A0 has P(A0=true)=w0. The variables A0,,Ak are the parents of Y. The conditional probability for Y, P(YA0,A1,,Ak), is 1 if any of the Ai are true and 0 if all of the Ai are false. Thus, w0 is the probability of Y when all of Xi are false; the probability of Y increases as more of the Xi become true.

Example 9.20.

Suppose the robot could get wet from rain or coffee. There is a probability that it gets wet from rain if it rains, and a probability that it gets wet from coffee if it has coffee, and a probability that it gets wet for other reasons. The robot gets wet if it gets wet from one of the reasons, giving the “or”. You could have P(wet_from_rainrain)=0.3, P(wet_from_coffeecoffee)=0.2, and, for the bias term, P(wet_for_other_reasons)=0.1. The robot is wet if it is wet from rain, wet from coffee, or wet for other reasons.

Log-linear Models and Logistic Regression

In a log-linear model unnormalized probabilities are specified using a product of terms, and probabilities are inferred by normalization. When the terms being multiplied are all positive, a product can be represented as the exponential of a sum. A sum of terms is often a convenient term to work with.

The simplest case is for a Boolean variable. To represent conditional probabilities of a Boolean variable h:

P(he) =P(he)P(e)
=P(he)P(he)+P(¬he)
=11+P(¬he)/P(he)
=11+exp((logP(he)/P(¬he)))
=sigmoid(logodds(he)).
  • The sigmoid function, sigmoid(x)=1/(1+exp(x)), plotted in Figure 7.11, has been used previously in this book for logistic regression and neural networks.

  • The conditional odds (as often used by bookmakers in gambling)

    odds(he) =P(he)P(¬he)
    =P(he)P(¬he)
    =P(he)1P(he).

    Decomposing P(he) into P(eh)P(h), and analogously for the numerator, gives

    odds(he)=P(eh)P(e¬h)P(h)P(¬h)

    where P(h)P(¬h)=P(h)1P(h) is the prior odds and P(eh)P(e¬h) is the likelihood ratio. When P(eh)/P(e¬h) is a product of terms, the log is a sum of terms.

The logistic regression model of a conditional probability P(YX1,,Xk) is of the form

P(Y=trueX1,,Xk)=sigmoid(iwiXi). (9.3)

Assume a dummy input X0 which is always 1; w0 is the bias. This corresponds to a decomposition of the conditional probability, where the likelihood ratio is a product of terms, one for each Xi.

Note that P(YX1= 0,,Xk= 0)=sigmoid(w0). Thus, w0 determines the probability when all of the parents are zero. Each wi specifies a value that should be added as Xi changes. P(YX1= 0,,Xi= 1,,Xk= 0)=sigmoid(w0+wi). The logistic regression model makes the independence assumption that the influence of each parent on the child does not depend on the other parents. In particular, it assumes that the odds can be decomposed into a product of terms that each only depend on a single variable. When learning logistic regression models (Section 7.3.2), the training data does not need to obey that independence, but rather the algorithm tries to find a logistic regression model that best predicts the training data (in the sense of having the lowest squared error or log loss).

Example 9.21.

Suppose the probability of wet given whether there is rain, coffee, kids, or whether the robot has a coat, P(wetRain,Coffee,Kids,Coat), is

sigmoid(1.0+2.0Rain+1.0Coffee+0.5Kids1.5Coat).

This implies the following conditional probabilities:

P(wet¬rain¬coffee¬kids¬coat)=sigmoid(1.0)=0.27
P(wetrain¬coffee¬kids¬coat)=sigmoid(1.0)=0.73
P(wetrain¬coffee¬kidscoat)=sigmoid(0.5)=0.38.

This requires fewer parameters than the 24=16 parameters required for a tabular representation, but makes more independence assumptions.

Noisy-or and logistic regression models are similar, but different. Noisy-or is typically used when the causal assumption that a variable is true if it is caused to be true by one of the parents, is appropriate. Logistic regression is used when the various parents add up to influence the child. See the box.

 

Noisy-or Compared to Logistic Regression

Noisy-or and logistic regression when used for Boolean parents are similar in many ways. They both take a parameter for each parent plus another parameter that is used when all parents are false. Logistic regression cannot represent zero probabilities. Noisy-or cannot represent cases where a parent being true makes the child less probable (although the negation can be a parent).

They can both represent the probability exactly when zero or one parent is true (as long as there are no zero probabilities for logistic regression). They are quite different in how they handle cases when multiple parents are true.

To see the difference, consider representing P(YX1,X2,X3,X4), using parameters w0,,w4. Assume the following probabilities when zero or one parent is true:

X1X2X3X4ProbFalseFalseFalseFalsep0= 0.01TrueFalseFalseFalsep1= 0.05FalseTrueFalseFalsep2= 0.1FalseFalseTrueFalsep3= 0.2FalseFalseFalseTruep4= 0.2

For noisy-or, w0=p0. 1(1w0)(1wi)=pi. Solving for wi gives wi=1(1pi)/(1w0).

For logistic regression, sigmoid(w0)=p0 and sigmoid(w0+wi)=pi.

The predictions for Y=true for the other assignments to Xs are:

X1X2X3X4Noisy-orLogistic RegressionFalseFalseTrueTrue0.3535350.860870FalseTrueFalseTrue0.2727270.733333FalseTrueTrueFalse0.2727270.733333FalseTrueTrueTrue0.4123050.985520TrueFalseFalseTrue0.2323230.565714TrueFalseTrueFalse0.2323230.565714TrueFalseTrueTrue0.3796550.969916TrueTrueFalseFalse0.1363640.366667TrueTrueFalseTrue0.3021120.934764TrueTrueTrueFalse0.3021120.934764TrueTrueTrueTrue0.4360500.997188

Logistic regression is much more extreme than noisy-or when multiple Xi are true. With noisy-or, each Xi=true probabilistically forces Y to be true. With logistic regression, each Xi=true provides independent evidence for Y.

If the probability p0 is increased, to say, 0.05, with p1,,p4 fixed, the probability of Y=true given assignments with multiple Xi true for noisy-or goes up (as there are more ways Y could be true) and for logistic regression goes down (as each Xi true provides less evidence of exceptionality).

 

One way to extend logistic regression to be able to represent more conditional probabilities is to allow weights for conjunctions of Boolean properties. For example, if Boolean Y has Booleans X1 and X2 as parents, four weights can be used to define the conditional distribution:

P(yX1,X2)=sigmoid(w0+w1X1+w2X2+w3X1X2)

where X1X2 is 1 only when both X1 and X2 are true – the product of variables with domain {0,1} corresponds to conjunction. Then

w0 =logit(P(yX1= 0,X2= 0))
w1 =logit(P(yX1= 1,X2= 0))w0
w2 =logit(P(yX1= 0,X2= 1))w0
w3 =logit(P(yX1= 1,X2= 1))w0w1w2

where the logit function is the inverse of the sigmoid.

In general, for P(YX1,,Xk) there are 2k parameters, one for each subset of {X1,,Xk} being conjoined (or multiplied). This is known as the canonical representation. This has the same number of parameters as a conditional probability table, but can be made simpler by not representing the zero weights. Logistic regression is the extreme form where all of the interaction (product) weights are zero. It is common to start with the logistic regression form and add as few interaction terms as needed. This is the representaion of conditioning probabilities for Boolean outputs learned by gradient-boosted trees, with the decision-tree paths representing conjunctions.

A logistic regression model for Boolean variables can be represented using weighted logical formulas, where a weighted logical formula is a pair of a formula and a weight, such that P(Y=trueX1,,Xk) is proportional to the exponential of the sum of the weights for the formulas for which Y is true, and the Xi have their given value. The model of Equation 9.3 can be specified by (y,w0) and (yxi,wi) for each i, where xi means Xi=true. For the canonical representation, more complicated formulas can be used; for example, the case in the previous paragraph is represented by adding the weighted formula (yx1x2,w3) to the logistic regression weighted formulas.

The extension of logistic regression to non-binary discrete variables is softmax regression; the softmax of linear functions.

 

Directed and Undirected Graphical Models

A belief network is a directed model; all of the arcs are directed. The model is defined in terms of conditional probabilities.

An alternative is an undirected model. A factor graph consists of a set of variables and a set of nonnegative factors, each with scope a subset of the variables. There is an arc between each variable and the factors it is in the scope of, similar to a constraint network. It is sometimes written as a Markov random field or Markov network where the factors are implicit: the nodes are the variables, with an edge between any pair of nodes if there is a factor containing both.

The joint probability is defined by

P(X1 =v1X2=v2Xn=vn)FF(XF=vf)

where XF is the tuple of variables in factor F and vF is the tuple of corresponding values.

The constant of proportionality,

X1,,XnFF(XF)

is called the partition function. The exact algorithms of Section 9.5 can be used to compute the partition function; those algorithms just assume a set of factors.

Sometimes the factors are defined in terms of weights, so that F(XF1,,XFk)=exp(wF(XF1,,XFk)). This changes the product above to

P(X1 =v1X2=v2Xn=vn)exp(FwF(XF=vf))

giving a log-linear model, which is useful for learning as the derivative is simpler and the nonnegative constraint happens automatically, although zero probabilities cannot be represented.

Note that a belief network can also be seen as a factor graph, where the factors are defined in terms of conditional probabilities. The directed and undirected models are collectively called graphical models.

A canonical representation is a representation that has a unique form. One problem with undirected models is that there is no canonical representation for a probability distribution. For example, modifying a factor on Xi by multiplying Xi=vi by a constant and modifying another factor on Xi by dividing by the same constant gives the same model. This means that the model cannot be learned modularly; each factor depends on the others. A belief network forces a particular factorization that gives a canonical representation for a distribution.