8.4 Probabilistic Inference

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

8.4.2 Representing Conditional Probabilities and Factors

A conditional probability distribution is a function on variables; given an assignment to the values of the variables, it gives a number. A factor is a function of a set of variables; the variables it depends on are the scope of the factor. Thus a conditional probability is a factor, as it is a function on variables. This section explores some variants for representing factors and conditional probabilities. Some of the representations are for arbitrary factors and some are specific to conditional probabilities.

Factors do not have to be implemented as conditional probability tables. The resulting tabular representation is often too large when there are many parents. Often, structure in conditional probabilities can be exploited.

One such structure exploits context-specific independence, where one variable is conditionally independent of another, given a particular value of the third variable.

Example 8.25.

Suppose a robot can go outside or get coffee (so the Action has domain {go_out,get_coffee}. Whether it gets wet (variable Wet) depends on whether there is rain (variable Rain) in the context that it went out or on whether the cup was full (variable Full) if it got coffee. Thus Wet is independent of Rain given Action=get_coffee, but is dependent on Rain given Action=go_out. Also, Wet is independent of Full given Action=go_out, but is dependent on Full given Action=get_coffee.

Context-specific independence may be exploited in a representation by not requiring numbers that are not needed. A simple representation for conditional probabilities that models context-specific independence is a decision tree, where the parents in a belief network correspond to the input features and the child corresponds to the target feature. Another representation is in terms of definite clauses with probabilities. Context-specific independence could also be represented as tables that have contexts that specify when they should be used, as in the following example.

Example 8.26.

The conditional probability P(WetAction,Rain,Full) could be represented as a decision tree, as definite clauses with probabilities, or as tables with contexts:

wetgo_outrain:0.8wetgo_out¬rain:0.1wetget_coffeefull:0.6wetget_coffee¬full:0.3go_out:RainWetProbtt0.8tf0.2ft0.1ff0.9get_coffee:FullWetProbtt0.6tf0.4ft0.3ff0.7

Another common representation is a noisy-or, where the child is true if one of the parents is activated and each parent has a probability of activation. So the child is an “or” of the activations of the parents. The noisy-or is defined as follows. If X has Boolean parents V1,,Vk, the probability is defined by k+1 parameters p0,,pk. We invent k new Boolean variables A0,A1,,Ak, where for each i>0, Ai has Vi as its only parent. Define P(Ai=trueVi=true)=pi and P(Ai=trueVi=false)=0. The bias term, A0 has P(A0)=p0. The variables A0,,Ak are the parents of X, and the conditional probability is that P(XA0,A1,,Ak) is 1 if any of the Ai are true and is 0 if all of the Ai are false. Thus p0 is the probability of X when all of Vi are false; the probability of X increases if more of the Vi become true.

Example 8.27.

Suppose the robot could get wet from rain or coffee. There is a probability that it gets wet from rain if it rains, and a probability that it gets wet from coffee if it has coffee, and a probability that it gets wet for other reasons. The robot gets wet if it gets wet from one of them, giving the “or”. We could have, P(wet_from_rainrain)=0.3, P(wet_from_coffeecoffee)=0.2 and, for the bias term, P(wet_for_other_reasons)=0.1. The robot is wet if it wet from rain, wet from coffee, or wet for other reasons.

A log-linear model is a model where probabilities are specified as a product of terms. When the terms are non-zero (they are all strictly positive), the log of a product is the sum of logs. The sum of terms is often a convenient term to work with. To see how such a form is used to represent conditional probabilities, we can write the conditional probability in the following way:

P(he) =P(he)P(he)+P(¬he)
=11+P(¬he)/P(he)
=11+e-(logP(he)/P(¬he))
=sigmoid(logodds(he))
  • The sigmoid function, sigmoid(x)=1/(1+e-x), plotted in Figure 7.9, has been used previously in this book for logistic regression and neural networks.

  • The conditional odds (as often used by bookmakers in gambling) is

    odds(he)= P(he)P(¬he)
    = P(eh)P(e¬h)*P(h)P(¬h)

    where P(h)P(¬h)=P(h)1-P(h) is the prior odds and P(eh)P(e¬h) is the likelihood ratio. For a fixed h, it is often useful to represent P(eh)/P(e¬h) as a product of terms, and so the log is a sum of terms.

The logistic regression model of a conditional probability P(XY1,,Yk) is of the form

P(xY1,,Yk)=sigmoid(iwi*Yi)

where Yi is assumed to have domain {0,1}. (Assume a dummy input Y0 which is always 1.) This corresponds to a decomposition of the conditional probability, where the probabilities are a product of terms for each Yi.

Note that P(XY1=0,,Yk=0)=sigmoid(w0). Thus w0 determines the probability when all of the parents are zero. Each wi specifies a value that should be added as Yi changes. If Yi is Boolean with values {0,1}, then P(XY1=0,,Yi=1,,Yk=0)=sigmoid(w0+wi). The logistic regression model makes the independence assumption that the influence of each parent on the child does not depend on the other parents. Learning logistic regression models was the topic of Section 7.3.2.

Example 8.28.

To represent the probability of wet given whether there is rain, coffee, kids, or whether the robot has a coat may be given by:

P(wet Rain,Coffee,Kids,Coat)
= sigmoid(-1.0+2.0*Rain+1.0*Coffee+0.5*Kids-1.5*Coat)

This implies the following conditional probabilities

P(wet¬rain¬coffee¬kids¬coat)=sigmoid(-1.0)=0.27.
P(wetrain¬coffee¬kids¬coat)=sigmoid(1.0)=0.73.
P(wetrain¬coffee¬kidscoat)=sigmoid(-0.5)=0.38.

This requires fewer parameters than the 24=16 parameters required for a tabular representation, but makes more independence assumptions.

Noisy-or and logistic regression models are similar, but different. Noisy-or is typically used when the causal assumption that a variable is true if it is caused to be true by one of the parents, is appropriate. Logistic regression is used when the various parents add-up to influence the child.