foundations of computational agents
A conditional probability distribution is a function on variables; given an assignment to the values of the variables, it gives a number. A factor is a function of a set of variables; the variables it depends on are the scope of the factor. Thus a conditional probability is a factor, as it is a function on variables. This section explores some variants for representing factors and conditional probabilities. Some of the representations are for arbitrary factors and some are specific to conditional probabilities.
Factors do not have to be implemented as conditional probability tables. The resulting tabular representation is often too large when there are many parents. Often, structure in conditional probabilities can be exploited.
One such structure exploits context-specific independence, where one variable is conditionally independent of another, given a particular value of the third variable.
Suppose a robot can go outside or get coffee (so the ${A}{\mathit{}}{c}{\mathit{}}{t}{\mathit{}}{i}{\mathit{}}{o}{\mathit{}}{n}$ has domain ${\mathrm{\{}}{g}{\mathit{}}{o}{\mathit{}}{\mathrm{\_}}{\mathit{}}{o}{\mathit{}}{u}{\mathit{}}{t}{\mathrm{,}}{g}{\mathit{}}{e}{\mathit{}}{t}{\mathit{}}{\mathrm{\_}}{\mathit{}}{c}{\mathit{}}{o}{\mathit{}}{f}{\mathit{}}{f}{\mathit{}}{e}{\mathit{}}{e}{\mathrm{\}}}$. Whether it gets wet (variable ${W}{\mathit{}}{e}{\mathit{}}{t}$) depends on whether there is rain (variable ${R}{\mathit{}}{a}{\mathit{}}{i}{\mathit{}}{n}$) in the context that it went out or on whether the cup was full (variable ${F}{\mathit{}}{u}{\mathit{}}{l}{\mathit{}}{l}$) if it got coffee. Thus ${W}{\mathit{}}{e}{\mathit{}}{t}$ is independent of ${R}{\mathit{}}{a}{\mathit{}}{i}{\mathit{}}{n}$ given ${A}{\mathit{}}{c}{\mathit{}}{t}{\mathit{}}{i}{\mathit{}}{o}{\mathit{}}{n}{\mathrm{=}}{g}{\mathit{}}{e}{\mathit{}}{t}{\mathit{}}{\mathrm{\_}}{\mathit{}}{c}{\mathit{}}{o}{\mathit{}}{f}{\mathit{}}{f}{\mathit{}}{e}{\mathit{}}{e}$, but is dependent on ${R}{\mathit{}}{a}{\mathit{}}{i}{\mathit{}}{n}$ given ${A}{\mathit{}}{c}{\mathit{}}{t}{\mathit{}}{i}{\mathit{}}{o}{\mathit{}}{n}{\mathrm{=}}{g}{\mathit{}}{o}{\mathit{}}{\mathrm{\_}}{\mathit{}}{o}{\mathit{}}{u}{\mathit{}}{t}$. Also, ${W}{\mathit{}}{e}{\mathit{}}{t}$ is independent of ${F}{\mathit{}}{u}{\mathit{}}{l}{\mathit{}}{l}$ given ${A}{\mathit{}}{c}{\mathit{}}{t}{\mathit{}}{i}{\mathit{}}{o}{\mathit{}}{n}{\mathrm{=}}{g}{\mathit{}}{o}{\mathit{}}{\mathrm{\_}}{\mathit{}}{o}{\mathit{}}{u}{\mathit{}}{t}$, but is dependent on ${F}{\mathit{}}{u}{\mathit{}}{l}{\mathit{}}{l}$ given ${A}{\mathit{}}{c}{\mathit{}}{t}{\mathit{}}{i}{\mathit{}}{o}{\mathit{}}{n}{\mathrm{=}}{g}{\mathit{}}{e}{\mathit{}}{t}{\mathit{}}{\mathrm{\_}}{\mathit{}}{c}{\mathit{}}{o}{\mathit{}}{f}{\mathit{}}{f}{\mathit{}}{e}{\mathit{}}{e}$.
Context-specific independence may be exploited in a representation by not requiring numbers that are not needed. A simple representation for conditional probabilities that models context-specific independence is a decision tree, where the parents in a belief network correspond to the input features and the child corresponds to the target feature. Another representation is in terms of definite clauses with probabilities. Context-specific independence could also be represented as tables that have contexts that specify when they should be used, as in the following example.
The conditional probability ${P}{\mathit{}}{\mathrm{(}}{W}{\mathit{}}{e}{\mathit{}}{t}{\mathrm{\mid}}{A}{\mathit{}}{c}{\mathit{}}{t}{\mathit{}}{i}{\mathit{}}{o}{\mathit{}}{n}{\mathrm{,}}{R}{\mathit{}}{a}{\mathit{}}{i}{\mathit{}}{n}{\mathrm{,}}{F}{\mathit{}}{u}{\mathit{}}{l}{\mathit{}}{l}{\mathrm{)}}$ could be represented as a decision tree, as definite clauses with probabilities, or as tables with contexts:
$$\begin{array}{c}{\text{}}\hfill & {w}{}{e}{}{t}{\leftarrow}{}{g}{}{o}{}{\mathrm{\_}}{}{o}{}{u}{}{t}{\wedge}{}{r}{}{a}{}{i}{}{n}{:}{0.8}\hfill \\ {w}{}{e}{}{t}{\leftarrow}{}{g}{}{o}{}{\mathrm{\_}}{}{o}{}{u}{}{t}{\wedge}{}{\mathrm{\neg}}{}{r}{}{a}{}{i}{}{n}{:}{0.1}\hfill \\ {w}{}{e}{}{t}{\leftarrow}{}{g}{}{e}{}{t}{}{\mathrm{\_}}{}{c}{}{o}{}{f}{}{f}{}{e}{}{e}{\wedge}{}{f}{}{u}{}{l}{}{l}{:}{0.6}\hfill \\ {w}{}{e}{}{t}{\leftarrow}{}{g}{}{e}{}{t}{}{\mathrm{\_}}{}{c}{}{o}{}{f}{}{f}{}{e}{}{e}{\wedge}{}{\mathrm{\neg}}{}{f}{}{u}{}{l}{}{l}{:}{0.3}\hfill \end{array}{}\begin{array}{cc}{g}{}{o}{}{\mathrm{\_}}{}{o}{}{u}{}{t}{:}\hfill & \begin{array}{ccc}{R}{}{a}{}{i}{}{n}\hfill & {W}{}{e}{}{t}\hfill & {P}{}{r}{}{o}{}{b}\hfill \\ {t}\hfill & {t}\hfill & {0.8}\hfill \\ {t}\hfill & {f}\hfill & {0.2}\hfill \\ {f}\hfill & {t}\hfill & {0.1}\hfill \\ {f}\hfill & {f}\hfill & {0.9}\hfill \end{array}\hfill \\ & \\ {g}{}{e}{}{t}{}{\mathrm{\_}}{}{c}{}{o}{}{f}{}{f}{}{e}{}{e}{:}\hfill & \begin{array}{ccc}{F}{}{u}{}{l}{}{l}\hfill & {W}{}{e}{}{t}\hfill & {P}{}{r}{}{o}{}{b}\hfill \\ {t}\hfill & {t}\hfill & {0.6}\hfill \\ {t}\hfill & {f}\hfill & {0.4}\hfill \\ {f}\hfill & {t}\hfill & {0.3}\hfill \\ {f}\hfill & {f}\hfill & {0.7}\hfill \end{array}\hfill \end{array}$$ |
Another common representation is a noisy-or, where the child is true if one of the parents is activated and each parent has a probability of activation. So the child is an “or” of the activations of the parents. The noisy-or is defined as follows. If $X$ has Boolean parents ${V}_{1},\mathrm{\dots},{V}_{k}$, the probability is defined by $k+1$ parameters ${p}_{0},\mathrm{\dots},{p}_{k}$. We invent $k$ new Boolean variables ${A}_{0},{A}_{1},\mathrm{\dots},{A}_{k}$, where for each $i>0$, ${A}_{i}$ has ${V}_{i}$ as its only parent. Define $P({A}_{i}=true\mid {V}_{i}=true)={p}_{i}$ and $P({A}_{i}=true\mid {V}_{i}=false)=0$. The bias term, ${A}_{0}$ has $P({A}_{0})={p}_{0}$. The variables ${A}_{0},\mathrm{\dots},{A}_{k}$ are the parents of $X$, and the conditional probability is that $P(X\mid {A}_{0},{A}_{1},\mathrm{\dots},{A}_{k})$ is 1 if any of the ${A}_{i}$ are true and is 0 if all of the ${A}_{i}$ are false. Thus ${p}_{0}$ is the probability of $X$ when all of ${V}_{i}$ are false; the probability of $X$ increases if more of the ${V}_{i}$ become true.
Suppose the robot could get wet from rain or coffee. There is a probability that it gets wet from rain if it rains, and a probability that it gets wet from coffee if it has coffee, and a probability that it gets wet for other reasons. The robot gets wet if it gets wet from one of them, giving the “or”. We could have, ${P}{\mathit{}}{\mathrm{(}}{w}{\mathit{}}{e}{\mathit{}}{t}{\mathit{}}{\mathrm{\_}}{\mathit{}}{f}{\mathit{}}{r}{\mathit{}}{o}{\mathit{}}{m}{\mathit{}}{\mathrm{\_}}{\mathit{}}{r}{\mathit{}}{a}{\mathit{}}{i}{\mathit{}}{n}{\mathrm{\mid}}{r}{\mathit{}}{a}{\mathit{}}{i}{\mathit{}}{n}{\mathrm{)}}{\mathrm{=}}{\mathrm{0.3}}$, ${P}{\mathit{}}{\mathrm{(}}{w}{\mathit{}}{e}{\mathit{}}{t}{\mathit{}}{\mathrm{\_}}{\mathit{}}{f}{\mathit{}}{r}{\mathit{}}{o}{\mathit{}}{m}{\mathit{}}{\mathrm{\_}}{\mathit{}}{c}{\mathit{}}{o}{\mathit{}}{f}{\mathit{}}{f}{\mathit{}}{e}{\mathit{}}{e}{\mathrm{\mid}}{c}{\mathit{}}{o}{\mathit{}}{f}{\mathit{}}{f}{\mathit{}}{e}{\mathit{}}{e}{\mathrm{)}}{\mathrm{=}}{\mathrm{0.2}}$ and, for the bias term, ${P}{\mathit{}}{\mathrm{(}}{w}{\mathit{}}{e}{\mathit{}}{t}{\mathit{}}{\mathrm{\_}}{\mathit{}}{f}{\mathit{}}{o}{\mathit{}}{r}{\mathit{}}{\mathrm{\_}}{\mathit{}}{o}{\mathit{}}{t}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{\mathrm{\_}}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{s}{\mathit{}}{o}{\mathit{}}{n}{\mathit{}}{s}{\mathrm{)}}{\mathrm{=}}{\mathrm{0.1}}$. The robot is wet if it wet from rain, wet from coffee, or wet for other reasons.
A log-linear model is a model where probabilities are specified as a product of terms. When the terms are non-zero (they are all strictly positive), the log of a product is the sum of logs. The sum of terms is often a convenient term to work with. To see how such a form is used to represent conditional probabilities, we can write the conditional probability in the following way:
$P(h\mid e)$ | $={\displaystyle \frac{P(h\wedge e)}{P(h\wedge e)+P(\mathrm{\neg}h\wedge e)}}$ | ||
$={\displaystyle \frac{1}{1+P(\mathrm{\neg}h\wedge e)/P(h\wedge e)}}$ | |||
$={\displaystyle \frac{1}{1+{e}^{-(\mathrm{log}P(h\wedge e)/P(\mathrm{\neg}h\wedge e))}}}$ | |||
$=sigmoid(\mathrm{log}odds(h\mid e))$ |
The sigmoid function, $sigmoid(x)=1/(1+{e}^{-x})$, plotted in Figure 7.9, has been used previously in this book for logistic regression and neural networks.
The conditional odds (as often used by bookmakers in gambling) is
$odds(h\mid e)=$ | $\frac{P(h\wedge e)}{P(\mathrm{\neg}h\wedge e)}$ | ||
$=$ | $\frac{P(e\mid h)}{P(e\mid \mathrm{\neg}h)}}*{\displaystyle \frac{P(h)}{P(\mathrm{\neg}h)}$ |
where $\frac{P(h)}{P(\mathrm{\neg}h)}=\frac{P(h)}{1-P(h)}$ is the prior odds and $\frac{P(e\mid h)}{P(e\mid \mathrm{\neg}h)}$ is the likelihood ratio. For a fixed $h$, it is often useful to represent $P(e\mid h)/P(e\mid \mathrm{\neg}h)$ as a product of terms, and so the log is a sum of terms.
The logistic regression model of a conditional probability $P(X\mid {Y}_{1},\mathrm{\dots},{Y}_{k})$ is of the form
$$P(x\mid {Y}_{1},\mathrm{\dots},{Y}_{k})=sigmoid\left(\sum _{i}{w}_{i}*{Y}_{i}\right)$$ |
where ${Y}_{i}$ is assumed to have domain $\{0,1\}$. (Assume a dummy input ${Y}_{0}$ which is always 1.) This corresponds to a decomposition of the conditional probability, where the probabilities are a product of terms for each ${Y}_{i}$.
Note that $P(X\mid {Y}_{1}=0,\mathrm{\dots},{Y}_{k}=0)=sigmoid({w}_{0})$. Thus ${w}_{0}$ determines the probability when all of the parents are zero. Each ${w}_{i}$ specifies a value that should be added as ${Y}_{i}$ changes. If ${Y}_{i}$ is Boolean with values $\{0,1\}$, then $P(X\mid {Y}_{1}=0,\mathrm{\dots},{Y}_{i}=1,\mathrm{\dots},{Y}_{k}=0)=sigmoid({w}_{0}+{w}_{i})$. The logistic regression model makes the independence assumption that the influence of each parent on the child does not depend on the other parents. Learning logistic regression models was the topic of Section 7.3.2.
To represent the probability of ${w}{\mathit{}}{e}{\mathit{}}{t}$ given whether there is rain, coffee, kids, or whether the robot has a coat may be given by:
${P}{(}{w}{e}{t}$ | ${\mid}{R}{a}{i}{n}{,}{C}{o}{f}{f}{e}{e}{,}{K}{i}{d}{s}{,}{C}{o}{a}{t}{)}$ | ||
${=}$ | ${s}{}{i}{}{g}{}{m}{}{o}{}{i}{}{d}{}{(}{-}{1.0}{+}{2.0}{*}{R}{}{a}{}{i}{}{n}{+}{1.0}{*}{C}{}{o}{}{f}{}{f}{}{e}{}{e}{+}{0.5}{*}{K}{}{i}{}{d}{}{s}{-}{1.5}{*}{C}{}{o}{}{a}{}{t}{)}$ |
This implies the following conditional probabilities
${P}{}{(}{w}{}{e}{}{t}{\mid}{\mathrm{\neg}}{}{r}{}{a}{}{i}{}{n}{\wedge}{\mathrm{\neg}}{}{c}{}{o}{}{f}{}{f}{}{e}{}{e}{\wedge}{\mathrm{\neg}}{}{k}{}{i}{}{d}{}{s}{\wedge}{\mathrm{\neg}}{}{c}{}{o}{}{a}{}{t}{)}{=}{s}{}{i}{}{g}{}{m}{}{o}{}{i}{}{d}{}{(}{-}{1.0}{)}{=}{0.27}{.}$ | ||
${P}{}{(}{w}{}{e}{}{t}{\mid}{r}{}{a}{}{i}{}{n}{\wedge}{\mathrm{\neg}}{}{c}{}{o}{}{f}{}{f}{}{e}{}{e}{\wedge}{\mathrm{\neg}}{}{k}{}{i}{}{d}{}{s}{\wedge}{\mathrm{\neg}}{}{c}{}{o}{}{a}{}{t}{)}{=}{s}{}{i}{}{g}{}{m}{}{o}{}{i}{}{d}{}{(}{1.0}{)}{=}{0.73}{.}$ | ||
${P}{}{(}{w}{}{e}{}{t}{\mid}{r}{}{a}{}{i}{}{n}{\wedge}{\mathrm{\neg}}{}{c}{}{o}{}{f}{}{f}{}{e}{}{e}{\wedge}{\mathrm{\neg}}{}{k}{}{i}{}{d}{}{s}{\wedge}{c}{}{o}{}{a}{}{t}{)}{=}{s}{}{i}{}{g}{}{m}{}{o}{}{i}{}{d}{}{(}{-}{0.5}{)}{=}{0.38}{.}$ |
This requires fewer parameters than the ${{\mathrm{2}}}^{{\mathrm{4}}}{\mathrm{=}}{\mathrm{16}}$ parameters required for a tabular representation, but makes more independence assumptions.
Noisy-or and logistic regression models are similar, but different. Noisy-or is typically used when the causal assumption that a variable is true if it is caused to be true by one of the parents, is appropriate. Logistic regression is used when the various parents add-up to influence the child.