8.1 Probability 8.1.2 Axioms for Probability 8.1.4 Expected Values

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

8.1.3 Conditional Probability

Probability is a measure of belief. Beliefs need to be updated when new evidence is observed.

The measure of belief in proposition $h$ given proposition $e$ is called the conditional probability of $h$ given $e$ , written $P(h\mid e)$ .

A proposition $e$ representing the conjunction of all of the agent’s observations of the world is called evidence. Given evidence $e$ , the conditional probability $P(h\mid e)$ is the agent’s posterior probability of $h$ . The probability $P(h)$ is the prior probability of $h$ and is the same as $P(h\mid true)$ because it is the probability before the agent has observed anything.

The evidence used for the posterior probability is everything the agent observes about a particular situation. Everything observed, and not just a few select observations, must be conditioned on to obtain the correct posterior probability.

Example 8.3.

For the diagnostic assistant, the prior probability distribution over possible diseases is used before the diagnostic agent finds out about the particular patient. Evidence is obtained through discussions with the patient, observing symptoms, and the results of lab tests. Essentially any information that the diagnostic assistant finds out about the patient is evidence. The assistant updates its probability to reflect the new evidence in order to make informed decisions.

Example 8.4.

The information that the delivery robot receives from its sensors is its evidence. When sensors are noisy, the evidence is what is known, such as the particular pattern received by the sensor, not that there is a person in front of the robot. The robot could be mistaken about what is in the world but it knows what information it received.

Semantics of Conditional Probability

Evidence $e$ , where $e$ is a proposition, will rule out all possible worlds that are incompatible with $e$ . Like the definition of logical consequence, the given proposition $e$ selects the possible worlds in which $e$ is true. As in the definition of probability, we first define the conditional probability over worlds, and then use this to define a probability over propositions.

Evidence $e$ induces a new probability $P(w\mid e)$ of world $w$ given $e$ . Any world where $e$ is false has conditional probability $0$ , and the remaining worlds are normalized so that the probabilities of the worlds sum to $1$ :

P(w\mid e)=\left\{\begin{array}[]{rcl}c*P(w)&\mbox{ if }&e\mbox{ is true in world }w\\ 0&\mbox{ if }&e\mbox{ is false in world }w\end{array}\right.

where $c$ is a constant (that depends on $e$ ) that ensures the posterior probability of all worlds sums to 1.

For $P(w\mid e)$ to be a probability measure over worlds for each $e$ :

	$\displaystyle 1$	$\displaystyle=\sum_{w}P(w\mid e)$
		$\displaystyle=\sum_{w\>:\>e\mbox{\scriptsize{} is true in }w}P(w\mid e)+\sum_{% w\>:\>e\mbox{\scriptsize{} is false in }w}P(w\mid e)$
		$\displaystyle=\sum_{w\>:\>e\mbox{\scriptsize{} is true in }w}c*P(w)+0$
		$\displaystyle=c*P(e)$

Therefore, $c=1/P(e)$ . Thus, the conditional probability is only defined if $P(e)>0$ . This is reasonable, as if $P(e)=0$ , $e$ is impossible.

The conditional probability of proposition $h$ given evidence $e$ is the sum of the conditional probabilities of the possible worlds in which $h$ is true. That is,

	$\displaystyle P(h\mid e)=$	$\displaystyle\sum_{w\>:\>h\mbox{\scriptsize{} is true in }w}P(w\mid e)$
	$\displaystyle\mbox{}=$	$\displaystyle\sum_{w\>:\>h\wedge e\mbox{\scriptsize{} is true in }w}P(w\mid e)% +\sum_{w\>:\>\neg h\wedge e\mbox{\scriptsize{} is true in }w}P(w\mid e)$
	$\displaystyle\mbox{}=$	$\displaystyle\sum_{w\>:\>h\wedge e\mbox{\scriptsize{} is true in }w}\frac{1}{P% (e)}*P(w)+0$
	$\displaystyle\mbox{}=$	$\displaystyle\frac{P(h\wedge e)}{P(e)}.$

The last form above is typically given as the definition of conditional probability. Here we have derived it as a consequence of a more basic definition.

Example 8.5.

As in Example 8.2, consider the worlds of Figure 8.1, each with probability 0.1. Given the evidence $Filled{=}false$ , only 4 worlds have a non-zero posterior probability. $P(Shape{=}circle\mid Filled{=}false)=0.25$ and $P(Shape{=}star\mid Filled{=}false)=0.5$ .

A conditional probability distribution, written $P(X\mid Y)$ where $X$ and $Y$ are variables or sets of variables, is a function of the variables: given a value $x\in domain(X)$ for $X$ and a value $y\in domain(Y)$ for $Y$ , it gives the value $P(X=x\mid Y=y)$ , where the latter is the conditional probability of the propositions.

Background Knowledge and Observation

The difference between background knowledge and observation was described in Section 5.4.1. When reasoning with uncertainty, the background model is described in terms of a probabilistic model, and the observations form evidence that must be conditioned on.

Within probability, there are two ways to state that $a$ is true:

•

The first is to state that the probability of $a$ is $1$ by writing $P(a)=1$ .
•

The second is to condition on $a$ , which involves using $a$ on the right-hand side of the conditional bar, as in $P(\cdot\mid a)$ .

The first method states that $a$ is true in all possible worlds. The second says that the agent is only interested in worlds where $a$ happens to be true.

Suppose an agent was told about a particular animal:

	$\displaystyle P(flies\mid bird)$	$\displaystyle=0.8,$
	$\displaystyle P(bird\mid emu)$	$\displaystyle=1.0,$
	$\displaystyle P(flies\mid emu)$	$\displaystyle=0.001.$

If the agent determines the animal is an emu, it cannot add the statement $P(emu)=1$ . No probability distribution satisfies these four assertions. If emu were true in all possible worlds, it would not be the case that in 0.8 of the possible worlds, the individual flies. The agent, instead, must condition on the fact that the individual is an emu.

To build a probability model, a knowledge base designer takes some knowledge into consideration and builds a probability model based on this knowledge. All knowledge acquired subsequently must be treated as observations that are conditioned on.

Suppose proposition $k$ represents an agent’s observations up to some time. The agent’s subsequent belief states can be modeled by either of the following:

•

construct a probability model for the agent’s belief before it had observed $k$ and then condition on the evidence $k$ conjoined with the subsequent evidence $e$ (i.e, for each proposition $\alpha$ use $P(\alpha\mid e\wedge k)$ )
•

construct a probability model, call it $P_{k}$ , which models the agent’s beliefs after observing $k$ , and then condition on subsequent evidence $e$ (i.e., use $P_{k}(\alpha\mid e)$ for proposition $\alpha$ ).

All subsequent probabilities will be identical no matter which construction was used. Building $P_{k}$ directly is sometimes easier because the model does not have to cover the cases of when $k$ is false. Sometimes, however, it is easier to build $P$ and condition on $k$ .

What is important is that there is a coherent stage where the probability model is reasonable and every subsequent observation is conditioned on.

The definition of conditional probability allows the decomposition of a conjunction into a product of conditional probabilities:

Proposition 8.3.

(Chain rule) For any propositions $\alpha_{1},\dots,\alpha_{n}$ :

	$\displaystyle P(\alpha_{1}\wedge\alpha_{2}\wedge\ldots\wedge\alpha_{n})=$	$\displaystyle P(\alpha_{1})*$
		$\displaystyle P(\alpha_{2}\mid\alpha_{1})*$
		$\displaystyle P(\alpha_{3}\mid\alpha_{1}\wedge\alpha_{2})*$
		$\displaystyle\vdots$
		$\displaystyle P(\alpha_{n}\mid\alpha_{1}\wedge\cdots\wedge\alpha_{n-1})$
	$\displaystyle=$	$\displaystyle\prod_{i=1}^{n}P(\alpha_{i}\mid\alpha_{1}\wedge\cdots\wedge\alpha% _{i-1}),$

where the right-hand side is assumed to be zero if any of the products are zero (even if some of them are undefined).

Note that any theorem about unconditional probabilities can be converted into a theorem about conditional probabilities by adding the same evidence to each probability. This is because the conditional probability measure is a probability measure. For example, case (e) of Proposition 8.2 implies $P(\alpha\vee\beta\mid k)=P(\alpha\mid k)+P(\beta\mid k)-P(\alpha\wedge\beta% \mid k)$ .

Bayes’ Rule

An agent using probability updates its belief when it observes new evidence. A new piece of evidence is conjoined to the old evidence to form the complete set of evidence. Bayes’ rule specifies how an agent should update its belief in a proposition based on a new piece of evidence.

Suppose an agent has a current belief in proposition $h$ based on evidence $k$ already observed, given by $P(h\mid k)$ , and subsequently observes $e$ . Its new belief in $h$ is $P(h\mid e\wedge k)$ . Bayes’ rule tells us how to update the agent’s belief in hypothesis $h$ as new evidence arrives.

Proposition 8.4.

(Bayes’ rule) As long as $P(e\mid k)\neq 0$ ,

{P(h\mid e\wedge k)={P(e\mid h\wedge k)*P(h\mid k)\over P(e\mid k)}}.

This is often written with the background knowledge $k$ implicit. In this case, if $P(e)\neq 0$ , then

{P(h\mid e)={P(e\mid h)*P(h)\over P(e)}}.

$P(e\mid h)$ is the likelihood and $P(h)$ is the prior probability of the hypothesis $h$ . Bayes’ rule states that the posterior probability is proportional to the likelihood times the prior.

Proof.

The commutativity of conjunction means that $h\wedge\mbox{}e$ is equivalent to $e\wedge\mbox{}h$ , and so they have the same probability given $k$ . Using the rule for multiplication in two different ways,

	$\displaystyle P(h\wedge e\mid k)$	$\displaystyle=P(h\mid e\wedge k)*P(e\mid k)$
	$\displaystyle\mbox{}=P(e\wedge h\mid k)$	$\displaystyle=P(e\mid h\wedge k)*P(h\mid k).$

The theorem follows from dividing the right-hand sides by $P(e\mid k)$ , which is not 0 by assumption. ∎

Often, Bayes’ rule is used to compare various hypotheses (different $h_{i}$ s). The denominator $P(e\mid k)$ is a constant that does not depend on the particular hypothesis, and so when comparing the relative posterior probabilities of hypotheses, the denominator can be ignored.

To derive the posterior probability, the denominator may be computed by reasoning by cases. If $H$ is an exclusive and covering set of propositions representing all possible hypotheses, then

	$\displaystyle P(e\mid k)$	$\displaystyle=\sum_{h\in H}P(e\wedge h\mid k)$
		$\displaystyle=\sum_{h\in H}P(e\mid h\wedge k)*P(h\mid k).$

Thus, the denominator of Bayes’ rule is obtained by summing the numerators for all the hypotheses. When the hypothesis space is large, computing the denominator is computationally difficult.

Generally, one of $P(e\mid h\wedge k)$ or $P(h\mid e\wedge k)$ is much easier to estimate than the other. Bayes’ rule is used to compute one from the other.

Example 8.6.

In medical diagnosis, the doctor observes a patient’s symptoms, and would like to know the likely diseases. Thus the doctor would like $P(Disease\mid Symptoms)$ . This is difficult to assess as it depends on the context (e.g., some diseases are more prevalent in hospitals). It is typically more easy to assess $P(Symtoms\mid Disease)$ as how the disease gives rise to the symptoms is typically less context dependent. These two are related by Bayes’ rule, where the prior probability of the disease, $P(Disease)$ , reflects the context.

Example 8.7.

The diagnostic assistant may need to know whether the light switch $s_{1}$ of Figure 1.8 is broken or not. You would expect that the electrician who installed the light switch in the past would not know if it is broken now, but would be able to specify how the output of a switch is a function of whether there is power coming into the switch, the switch position, and the status of the switch (whether it is working, shorted, installed upside-down, etc.). The prior probability for the switch being broken depends on the maker of the switch and how old it is. Bayes’ rule lets an agent infer the status of the switch given the prior and the evidence.

Example 8.8.

Suppose an agent has information about the reliability of fire alarms. It may know how likely it is that an alarm will work if there is a fire. To determine the probability that there is a fire, given that there is an alarm, Bayes’ rule gives:

	$\displaystyle P(fire\mid alarm)$	$\displaystyle=\frac{P(alarm\mid fire)*P(fire)}{P(alarm)}$
		$\displaystyle=\frac{P(alarm\mid fire)P(fire)}{P(alarm\mid fire)P(fire)+P(% alarm\mid\neg fire)*P(\neg fire)}$

where $P(alarm\mid fire)$ is the probability that the alarm worked, assuming that there was a fire. It is a measure of the alarm’s reliability. The expression $P(fire)$ is the probability of a fire given no other information. It is a measure of how fire-prone the building is. $P(alarm)$ is the probability of the alarm sounding, given no other information. $P(fire\mid alarm)$ is more difficult to directly represent because it depends, for example, on how much vandalism there is in the neighborhood.

Other Possible Measures of Belief

Justifying other measures of belief is problematic. Consider, for example, the proposal that the belief in $\alpha\wedge\beta$ is some function of the belief in $\alpha$ and the belief in $\beta$ . Such a measure of belief is called compositional. To see why this is not sensible, consider the single toss of a fair coin. Compare the case where $\alpha$ is “the coin will land heads”, $\beta_{1}$ is “the coin will land tails” and $\beta_{2}$ is “the coin will land heads.” The belief in $\beta_{1}$ would be the same as the belief in $\beta_{2}$ . But the belief in $\alpha\wedge\beta_{1}$ , which is impossible, is very different from the belief in $\alpha\wedge\beta_{2}$ , which is the same as $\alpha$ .

The conditional probability $P(f\mid e)$ is very different from the probability of the implication $P(e\rightarrow f)$ . The latter is the same as $P(\neg e\vee f)$ , which is the measure of the interpretations for which $f$ is true or $e$ is false. For example, suppose there is a domain where birds are relatively rare, and non-flying birds are a small proportion of the birds. Here $P(\neg flies\mid bird)$ would be the proportion of birds that do not fly, which would be low. $P(bird\rightarrow\neg flies)$ is the same as $P(\neg bird\vee\neg flies)$ , which would be dominated by non-birds and so would be high. Similarly, $P(bird\rightarrow flies)$ would also be high, the probability also being dominated by the non-birds. It is difficult to imagine a situation where the probability of an implication is the kind of knowledge that is appropriate or useful.

Artificial Intelligence 2E