foundations of computational agents
In Example 9.20, the action ${C}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{c}{\mathit{}}{k}{\mathit{}}{\mathrm{\_}}{\mathit{}}{s}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$ provides information about fire. Checking for smoke costs ${\mathrm{20}}$ units and does not provide any direct reward; however, in an optimal policy, it is worthwhile to check for smoke when there is a report because the agent can condition its further actions on the information obtained. Thus, the information about smoke is valuable to the agent, even though smoke only provides imperfect information about whether there is fire.
One of the important lessons from this example is that an information-seeking action, such as $Check\mathrm{\_}smoke$, can be treated in the same way as any other action, such as $Call$. An optimal policy often includes actions whose only purpose is to find information, as long as subsequent actions can condition on some effect of the action. Most actions do not just provide information; they also have a more direct effect on the world.
Information is valuable to agents because it helps them make better decisions.
If $X$ is a random variable and $D$ is a decision variable, the value of information about $X$ for decision $D$ is how much extra utility can be obtained by knowing the value for $X$ when decision $D$ is made. This depends on what is controlled and what else is observed for each decision, which is the information provided in a decision network.
The value of information about $X$ for decision $D$ in no-forgetting decision network $N$ is:
the value of decision network $N$ with an arc added from $X$ to $D$, and with arcs added from $X$ to the decisions after $D$ to ensure that the network remains a no-forgetting decision network
minus the value of the decision network $N$ where $D$ does not have information about $X$, and the no-forgetting arcs are not added.
This is only defined when $X$ is not a successor of $D$, because that would cause a cycle. (Something more sophisticated must be done when adding the arc from $X$ to $D$ causes a cycle.)
In Example 9.13, consider how much it could be worth to get a better forecast. The value of getting perfect information about the weather for the decision about whether to take an umbrella is the difference between the value of the network with an arc from ${W}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{t}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{r}$ to ${U}{\mathit{}}{m}{\mathit{}}{b}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{l}{\mathit{}}{l}{\mathit{}}{a}$ which, as calculated in Example 9.21, is 91 and the original network, which, as computed in Example 9.13, is 77. Thus, the value of information about ${W}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{t}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{r}$ for the ${U}{\mathit{}}{m}{\mathit{}}{b}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{l}{\mathit{}}{l}{\mathit{}}{a}$ decision is ${\mathrm{91}}{\mathrm{-}}{\mathrm{77}}{\mathrm{=}}{\mathrm{14}}$.
The value of information has some interesting properties:
The value of information is never negative. The worst that can happen is that the agent can ignore the information.
If an optimal decision is to do the same thing no matter which value of $X$ is observed, the value of information $X$ is zero. If the value of information $X$ is zero, there is an optimal policy that does not depend on the value of $X$ (i.e., the same action can be chosen no matter which value of $X$ is observed).
The value of information is a bound on the amount the agent should be willing to pay (in terms of loss of utility) for information $X$ for decision $D$. It is an upper bound on the amount that imperfect information about the value of $X$ at decision $D$ would be worth. Imperfect information is the information available from a noisy sensor of $X$. It is not worth paying more for a sensor of $X$ than the value of information about $X$ for the earliest decision that could use the information of $X$.
In the fire alarm problem of Example 9.20, the agent may be interested in knowing whether it is worthwhile try to detect tampering. To determine how much a tampering sensor could be worth, consider the value of information about tampering.
The following are the values (the expected utility of the optimal policy, to one decimal point) for some variants of the network. Let ${{N}}_{{\mathrm{0}}}$ be the original network.
The network ${{N}}_{{0}}$ has a value of ${-}{22.6}$.
Let ${{N}}_{{1}}$ be the same as ${{N}}_{{0}}$ but with an arc added from ${T}{}{a}{}{m}{}{p}{}{e}{}{r}{}{i}{}{n}{}{g}$ to ${C}{}{a}{}{l}{}{l}$. ${{N}}_{{1}}$ has a value of ${-}{21.3}$.
Let ${{N}}_{{2}}$ be the same as ${{N}}_{{1}}$ except that it also has an arc from ${T}{}{a}{}{m}{}{p}{}{e}{}{r}{}{i}{}{n}{}{g}$ to ${C}{}{h}{}{e}{}{c}{}{k}{}{\mathrm{\_}}{}{s}{}{m}{}{o}{}{k}{}{e}$. ${{N}}_{{2}}$ has a value of ${-}{20.9}$.
Let ${{N}}_{{3}}$ be the same as ${{N}}_{{2}}$ but without the arc from ${R}{}{e}{}{p}{}{o}{}{r}{}{t}$ to ${C}{}{h}{}{e}{}{c}{}{k}{}{\mathrm{\_}}{}{s}{}{m}{}{o}{}{k}{}{e}$. ${{N}}_{{3}}$ has the same value as ${{N}}_{{2}}$.
The difference in the values of the optimal policies for the first two decision networks, namely ${\mathrm{1.3}}$, is the value of information about ${T}{\mathit{}}{a}{\mathit{}}{m}{\mathit{}}{p}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{i}{\mathit{}}{n}{\mathit{}}{g}$ for the decision ${C}{\mathit{}}{a}{\mathit{}}{l}{\mathit{}}{l}$ in network ${{N}}_{{\mathrm{0}}}$. The value of information about ${T}{\mathit{}}{a}{\mathit{}}{m}{\mathit{}}{p}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{i}{\mathit{}}{n}{\mathit{}}{g}$ for the decision ${C}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{c}{\mathit{}}{k}{\mathit{}}{\mathrm{\_}}{\mathit{}}{s}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$ in network ${{N}}_{{\mathrm{0}}}$ is 1.7. Therefore installing a tampering sensor could at most give an increase of 1.7 in expected utility.
In the context ${{N}}_{{\mathrm{3}}}$, the value of information about ${T}{\mathit{}}{a}{\mathit{}}{m}{\mathit{}}{p}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{i}{\mathit{}}{n}{\mathit{}}{g}$ for ${C}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{c}{\mathit{}}{k}{\mathit{}}{\mathrm{\_}}{\mathit{}}{s}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$, is 0. In the optimal policy for the network with both arcs, the information about ${A}{\mathit{}}{l}{\mathit{}}{a}{\mathit{}}{r}{\mathit{}}{m}$ is ignored in the optimal decision function for ${C}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{c}{\mathit{}}{k}{\mathit{}}{\mathrm{\_}}{\mathit{}}{s}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$; the agent never checks for smoke when deciding whether to call in the optimal policy when ${A}{\mathit{}}{l}{\mathit{}}{a}{\mathit{}}{r}{\mathit{}}{m}$ is a parent of ${C}{\mathit{}}{a}{\mathit{}}{l}{\mathit{}}{l}$.
The value of control specifies how much it is worth to control a variable. In its simplest form, it is the change in value of a decision network where a random variable is replaced by a decision variable, and arcs are added to make it a no-forgetting network. If this is done, the change in utility is non-negative; the resulting network always has an equal or higher expected utility than the original network.
In the fire alarm decision network of Figure 9.11, you may be interested in the value of controlling tampering. This could, for example, be used to estimate how much it is worth to add security guards to prevent tampering. To compute this, compare the value of the decision network of Figure 9.11 to the decision network where ${T}{\mathit{}}{a}{\mathit{}}{m}{\mathit{}}{p}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{i}{\mathit{}}{n}{\mathit{}}{g}$ is a decision node and a parent of the other two decision nodes.
To determine the value of control, turn the ${T}{\mathit{}}{a}{\mathit{}}{m}{\mathit{}}{p}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{i}{\mathit{}}{n}{\mathit{}}{g}$ node into a decision node and make it a parent of the other two decisions. The value of the resulting network is ${\mathrm{-}}{\mathrm{20.7}}$. This can be compared to the value of ${{N}}_{{\mathrm{3}}}$ in Example 9.24 (which has the same arcs, and differs in whether ${T}{\mathit{}}{a}{\mathit{}}{m}{\mathit{}}{p}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{i}{\mathit{}}{n}{\mathit{}}{g}$ is a decision or random node), which was ${\mathrm{-}}{\mathrm{20.9}}$. Notice that control is more valuable than information.
The previous description assumed the parents of the random variable that is being controlled become parents of the decision variable. In this case, the value of control is never negative. However, if the parents of the decision node do not include all of the parents of the random variable, it is possible that control is less valuable than information. In general, one must be explicit about what information will be available when controlling a variable.
Consider controlling the variable ${S}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$ in Figure 9.11. If ${F}{\mathit{}}{i}{\mathit{}}{r}{\mathit{}}{e}$ is a parent of the decision variable ${S}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$, it has to be a parent of ${C}{\mathit{}}{a}{\mathit{}}{l}{\mathit{}}{l}$ to make it a no-forgetting network. The expected utility of the resulting network with ${S}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$ coming before ${C}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{c}{\mathit{}}{k}{\mathit{}}{\mathrm{\_}}{\mathit{}}{s}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$ is ${\mathrm{-}}{\mathrm{2.0}}$. The value of controlling ${S}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$ in this situation is due to observing ${F}{\mathit{}}{i}{\mathit{}}{r}{\mathit{}}{e}$. The resulting optimal decision is to call if there is a fire and not call otherwise.
Suppose the agent were to control ${S}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$ without observing ${F}{\mathit{}}{i}{\mathit{}}{r}{\mathit{}}{e}$. That is, the agent can decide to make smoke or prevent smoke, and ${F}{\mathit{}}{i}{\mathit{}}{r}{\mathit{}}{e}$ is not a parent of any decision. This situation can be modeled by making ${S}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$ a decision variable with no parents. In this case, the expected utility is ${\mathrm{-}}{\mathrm{23.20}}$, which is worse than the initial decision network, because blindly controlling ${S}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$ loses its ability to act as a sensor for ${F}{\mathit{}}{i}{\mathit{}}{r}{\mathit{}}{e}$.