foundations of computational agents
The axioms of probability are very weak and provide few constraints on allowable conditional probabilities. For example, if there are $n$ binary variables, there are ${2}^{n}-1$ numbers to be assigned to give a complete probability distribution from which arbitrary conditional probabilities can be derived. To determine any probability, you may have to start with an enormous database of probabilities.
A useful way to limit the amount of information required is to assume that each variable only directly depends on a few other variables. This uses assumptions of conditional independence. Not only does it reduce how many numbers are requires to specify a model, but also the independence structure may be exploited for efficient reasoning.
Reducing the Numbers
Two main approaches are used to overcome the need for so many numbers to specify a probability distribution:
Assume that the knowledge of the truth of one proposition does not affect the agent’s belief in another proposition in the context of other propositions.
Assume that probabilities are as uniform as possible given the available information.
The distinction between allowing representations of independence and using maximum entropy or random worlds highlights an important difference between views of a knowledge representation:
The first view is that a knowledge representation provides a high-level modeling language that lets us model a domain in a reasonably natural way. According to this view, it is expected that knowledge representation designers prescribe how to use the representation language by providing a user manual on how to describe domains of interest.
The second view is that a knowledge representation should allow someone to add whatever knowledge they may have about a domain. The knowledge representation should fill in the rest in a commonsense manner. According to this view, it is unreasonable for a knowledge representation designer to specify how particular knowledge should be encoded.
Judging a knowledge representation by the wrong criteria does not result in a fair assessment.
A belief network is a representation of a particular independence among variables. Belief networks should be viewed as a modeling language. Many domains are concisely and naturally represented by exploiting the independencies that belief networks compactly represent.
Once the network structure and the domains of the variables for a belief network are defined, which numbers are required (the conditional probabilities) are prescribed. The user cannot simply add arbitrary conditional probabilities but must follow the network’s structure. If the numbers required of a belief network are provided and are locally consistent, the whole network will be consistent.
In contrast, the maximum entropy or random worlds approaches infer the most random worlds that are consistent with a probabilistic knowledge base. They form a probabilistic knowledge representation of the second type. For the random worlds approach, any numbers that happen to be available are added and used. However, if you allow someone to add arbitrary probabilities, it is easy for the knowledge to be inconsistent with the axioms of probability. Moreover, it is difficult to justify an answer as correct if the assumptions are not made explicit.
As long as the value of $P(h\mid e)$ is not $0$ or $1$, the value of $P(h\mid e)$ does not constrain the value of $P(h\mid f\wedge e)$. This latter probability could have any value in the range $[0,1]$. It is $1$ when $f$ implies $h$, and it is $0$ if $f$ implies $\mathrm{\neg}h$. A common kind of qualitative knowledge is of the form $P(h\mid e)=P(h\mid f\wedge e)$, which specifies $f$ is irrelevant to the probability of $h$ given that $e$ is observed. This idea applies to random variables, as in the following definition.
Random variable $X$ is conditionally independent of random variable $Y$ given a set of random variables $Zs$ if
$$P(X\mid Y,Zs)=P(X\mid Zs)$$ |
whenever the probabilities are well defined. This means that for all $x\in domain(X)$, for all $y\in domain(Y)$, and for all $z\in domain(Zs)$, if $P(Y=y\wedge Zs=z)>0$,
$$P(X=x\mid Y=y\wedge Zs=z)=P(X=x\mid Zs=z).$$ |
That is, given a value of each variable in $Zs$, knowing $Y$’s value does not affect the belief in the value of $X$.
Consider a probabilistic model of students and exams. It is reasonable to assume that the random variable ${I}{\mathit{}}{n}{\mathit{}}{t}{\mathit{}}{e}{\mathit{}}{l}{\mathit{}}{l}{\mathit{}}{i}{\mathit{}}{g}{\mathit{}}{e}{\mathit{}}{n}{\mathit{}}{c}{\mathit{}}{e}$ is independent of ${W}{\mathit{}}{o}{\mathit{}}{r}{\mathit{}}{k}{\mathit{}}{s}{\mathit{}}{\mathrm{\_}}{\mathit{}}{h}{\mathit{}}{a}{\mathit{}}{r}{\mathit{}}{d}$, given no observations. If you find that a student works hard, it does not tell you anything about their intelligence.
The answers to the exam (the variable ${A}{\mathit{}}{n}{\mathit{}}{s}{\mathit{}}{w}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{s}$) would depend on whether the student is intelligent and works hard. Thus, given ${A}{\mathit{}}{n}{\mathit{}}{s}{\mathit{}}{w}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{s}$, ${I}{\mathit{}}{n}{\mathit{}}{t}{\mathit{}}{e}{\mathit{}}{l}{\mathit{}}{l}{\mathit{}}{i}{\mathit{}}{g}{\mathit{}}{e}{\mathit{}}{n}{\mathit{}}{t}$ would be dependent on ${W}{\mathit{}}{o}{\mathit{}}{r}{\mathit{}}{k}{\mathit{}}{s}{\mathit{}}{\mathrm{\_}}{\mathit{}}{h}{\mathit{}}{a}{\mathit{}}{r}{\mathit{}}{d}$; if you found someone had insightful answers, and did not work hard, your belief that they are intelligent would go up.
The grade on the exam (variable ${G}{\mathit{}}{r}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{e}$) should depend on the student’s answers, not on the intelligence or whether the student worked hard. Thus ${G}{\mathit{}}{r}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{e}$ would be independent of ${I}{\mathit{}}{n}{\mathit{}}{t}{\mathit{}}{e}{\mathit{}}{l}{\mathit{}}{l}{\mathit{}}{i}{\mathit{}}{g}{\mathit{}}{e}{\mathit{}}{n}{\mathit{}}{c}{\mathit{}}{e}$ given ${A}{\mathit{}}{n}{\mathit{}}{s}{\mathit{}}{w}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{s}$. However, if the answers were not observed, ${I}{\mathit{}}{n}{\mathit{}}{t}{\mathit{}}{e}{\mathit{}}{l}{\mathit{}}{l}{\mathit{}}{i}{\mathit{}}{g}{\mathit{}}{e}{\mathit{}}{n}{\mathit{}}{c}{\mathit{}}{e}$ will affect ${G}{\mathit{}}{r}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{e}$ (because highly intelligent students would be expected to have different answers than not so intelligent students); thus ${G}{\mathit{}}{r}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{e}$ is dependent on ${I}{\mathit{}}{n}{\mathit{}}{t}{\mathit{}}{e}{\mathit{}}{l}{\mathit{}}{l}{\mathit{}}{i}{\mathit{}}{g}{\mathit{}}{e}{\mathit{}}{n}{\mathit{}}{c}{\mathit{}}{e}$ given no observations.
The following four statements are equivalent, as long as the conditional probabilities are well defined
$X$ is conditionally independent of $Y$ given $Z$.
$Y$ is conditionally independent of $X$ given $Z$.
$P(X=x\mid Y=y\wedge Z=z)=P(X=x\mid Y={y}^{\prime}\wedge Z=z)$ for all values $x$, $y$, ${y}^{\prime}$ and $z$. That is, in the context that you are given a value for $Z$, changing the value of $Y$ does not affect the belief in $X$.
$P(X,Y\mid Z)=P(X\mid Z)P(Y\mid Z)$.
The proof is left as an exercise. See Exercise 3.
Variables $X$ and $Y$ are unconditionally independent if $P(X,Y)=P(X)P(Y)$, that is, if they are conditionally independent given no observations. Note that $X$ and $Y$ being unconditionally independent does not imply they are conditionally independent given some other information $Z$.
Conditional independence is a useful assumption that is often natural to assess and can be exploited in inference. It is very rare that we would have a table of probabilities of worlds and assess independence numerically.