8.1 Probability

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

8.1.1 Semantics of Probability

Probability theory is built on the foundation of worlds and variables. The variables in probability theory are referred to as random variables. The term random variable is somewhat of a misnomer because it is neither random nor variable. As discussed in Section 4.1, worlds could be described in terms of variables; a world is a function that maps each variable to its value. Alternatively, variables could be described in terms of worlds; a variable is a function from worlds into the domain of the variable.

Variables will be written starting with an uppercase letter. Each variable has a domain which is the set of values that the variable can take. A Boolean variable is a variable with domain {true,false}. We will write the assignment of true to a variable as the lower-case variant of the variable, e.g., Happy=true is written as happy and Fire=true is fire. A discrete variable has a domain that is a finite or countable set.

A primitive proposition is an assignment of a value to a variable, or an inequality between variables and values or between variables (e.g., A=true, X<7 or Y>Z). Propositions are built from primitive propositions using logical connectives.

This chapter mainly considers discrete variables with finite domains. The examples will have few variables, but modern applications may have thousands, millions or even billions of variables (or even infinitely many variables). For example, a world could consist of the symptoms, diseases and test results for all of the patients and care providers in a hospital throughout time. The model effectively goes on forever into the future, but we will only ever reason about a finite past and future. We might be able to answer questions about the probability that a patient with a particular combination of symptoms may come into the hospital in the next few years. There are infinitely many worlds whenever some variables have infinite domains or there are infinitely many variables.

We first define a probability over finite sets of worlds with finitely many variables and use this to define the probability of propositions.

A probability measure is a function P from worlds into the non-negative real numbers such that,

wΩP(w)=1

where Ω is the set of all possible worlds.

The use of 1 as the probability of the set of all of the worlds is just by convention. You could just as well use 100.

The definition of P is extended to cover propositions. The probability of proposition α, written P(α), is the sum of the probabilities of possible worlds in which α is true. That is,

P(α)=ω:α is true in ωP(ω).

Note that this definition is consistent with the probability of worlds, because if proposition α completely describes a single world, the probability of α and the probability of the world are equal.

Figure 8.1: Ten worlds described by variables Filled and Shape
Example 8.2.

Consider the ten worlds of Figure 8.1, with Boolean variable Filled, and with variable Shape with domain {circle,triangle,star}. Each world is defined by its shape, whether it’s filled and its position. Suppose the probability of each of these 10 worlds is 0.1, and any other worlds have probability 0. Then P(Shape=circle)=0.5 and P(Filled=false)=0.4. P(Shape=circleFilled=false)=0.1

If X is a random variable, a probability distribution, P(X), over X is a function from the domain of X into the real numbers such that, given a value xdomain(X), P(x) is the probability of the proposition X=x. A probability distribution over a set of variables is a function from the values of those variables into a probability. For example, P(X,Y) is a probability distribution over X and Y such that P(X=x,Y=y), where xdomain(X) and ydomain(Y), has the value P(X=xY=y), where X=xY=y is a proposition and P is the function on propositions defined above. Whether P refers to a function on propositions or a probability distribution should be clear from context.

If X1Xn are all of the random variables, then an assignment to all of the random variables corresponds to a world, and the probability of the proposition defining a world is equal to the probability of the world. The distribution over all worlds, P(X1,,Xn) is called the joint probability distribution.

Beyond Finitely Many Worlds

The definition of probability given in this chapter works when there are finitely many worlds.

There are infinitely many worlds when

  • the domain of a variable is infinite, for example the domain of a variable height might be the set of nonnegative real numbers or

  • there are infinitely many variables, for example, there might be a variable for the location of a robot for every millisecond from now infinitely into the future

When there are infinitely many worlds, probability is defined on a measure over sets of worlds. A probability measure is a nonnegative function μ from sets of worlds into the real numbers, that satisfies the axioms: μ(S1S2)=μ(S1)μ(S2) if S1S2={}, and μ(Ω)=1 where Ω is the set of all worlds. μ does not have to be defined over all sets of worlds, just those defined by logical formulas. The probability of proposition α is defined by P(α)=μ({w:α is true in w}).

Variables with continuous domains typically do not have a probability distribution, because the probability of a set of worlds can be non-zero even though the measure of each individual world is 0. For variables with real-valued domains, a probability density function, written as p, is a function from reals into non-negative reals that integrates to 1. The probability that a real-valued random variable X has value between a and b is given by

P(aXb)=abp(X) dX.

This allows the probability of any formula about intervals and less than to be well defined. It is possible that, for every real number a, P(X=a)=P(aXa)=0.

A parametric distribution is one where the probability or density function is described by a formula. Although not all distributions can be described by formulas, all of the ones that are able to be represented are. Sometimes statisticians use the term parametric to mean a distribution described using a fixed, finite number of parameters. A non-parametric distribution is one where the number of parameters is not fixed. (Oddly, non-parametric typically means “many parameters”.)

Another common method is to consider only discretizations of finitely many worlds. For example, only consider height to the nearest centimeter or micron, and only consider heights up to some finite number (e.g., a kilometer). Or only consider the location of the robot for a millennium. While there might be a lot of worlds, there are only finitely many. A challenge is to define representations that work for any (fine enough) discretization.