foundations of computational agents
A policy specifies what the agent should do under all contingencies. A policy consists of a decision function for each decision variable. A decision function for a decision variable is a function that specifies a value for the decision variable for each assignment of values to its parents. Thus, a policy specifies, for each decision variable, what the agent will do for each of the possible observations.
In Example 9.13, some of the policies are:
Always bring the umbrella.
Bring the umbrella only if the forecast is “rainy.”
Bring the umbrella only if the forecast is “sunny.”
There are eight different policies, because there are three possible forecasts and there are two choices for each forecast.
In Example 9.15, a policy specifies a decision function for ${C}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{c}{\mathit{}}{k}{\mathit{}}{\mathrm{\_}}{\mathit{}}{s}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$ and a decision function for ${C}{\mathit{}}{a}{\mathit{}}{l}{\mathit{}}{l}$. Some of the policies are:
Never check for smoke, and call only if there is a report.
Always check for smoke, and call only if it sees smoke.
Check for smoke if there is a report, and call only if there is a report and it sees smoke.
Check for smoke if there is no report, and call when it does not see smoke.
Always check for smoke and never call.
There are four decision functions for ${C}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{c}{\mathit{}}{k}{\mathit{}}{\mathrm{\_}}{\mathit{}}{s}{\mathit{}}{m}{\mathit{}}{o}{\mathit{}}{k}{\mathit{}}{e}$. There are ${{\mathrm{2}}}^{{\mathrm{8}}}$ decision functions for ${C}{\mathit{}}{a}{\mathit{}}{l}{\mathit{}}{l}$; for each of the eight assignments of values to the parents of ${C}{\mathit{}}{a}{\mathit{}}{l}{\mathit{}}{l}$, the agent can choose to call or not. Thus there are ${\mathrm{4}}{\mathrm{*}}{{\mathrm{2}}}^{{\mathrm{8}}}{\mathrm{=}}{\mathrm{1024}}$ different policies.
Each policy has an expected utility for an agent that follows the policy. A rational agent should adopt the policy that maximizes its expected utility.
A possible world specifies a value for each random variable and each decision variable. A possible world $\omega $ satisfies policy $\pi $ if for every decision variable $D$, $D(\omega )$ has the value specified by the policy given the values of the parents of $D$ in the possible world.
A possible world corresponds to a complete history and specifies the values of all random and decision variables, including all observed variables. Possible world $\omega $ satisfies policy $\pi $ if $\omega $ is one possible unfolding of history given that the agent follows policy $\pi $.
The expected utility of policy $\pi $ is
$$\mathcal{E}(u\mid \pi )=\sum _{\omega \text{satisfies}\pi}u(\omega )*P(\omega ),$$ |
where $P(\omega )$, the probability of world $\omega $, is the product of the probabilities of the values of the chance nodes given their parents’ values in $\omega $, and $u(\omega )$ is the value of the utility $u$ in world $\omega $.
Consider Example 9.13, let ${{\pi}}_{{\mathrm{1}}}$ be the policy to take the umbrella if the forecast is cloudy and to leave it at home otherwise. The worlds that satisfy this policy are:
${W}{}{e}{}{a}{}{t}{}{h}{}{e}{}{r}$ | ${F}{}{o}{}{r}{}{e}{}{c}{}{a}{}{s}{}{t}$ | ${U}{}{m}{}{b}{}{r}{}{e}{}{l}{}{l}{}{a}$ |
---|---|---|
${n}{}{o}{}{r}{}{a}{}{i}{}{n}$ | ${s}{}{u}{}{n}{}{n}{}{y}$ | ${l}{}{e}{}{a}{}{v}{}{e}{}{\mathrm{\_}}{}{i}{}{t}$ |
${n}{}{o}{}{r}{}{a}{}{i}{}{n}$ | ${c}{}{l}{}{o}{}{u}{}{d}{}{y}$ | ${t}{}{a}{}{k}{}{e}{}{\mathrm{\_}}{}{i}{}{t}$ |
${n}{}{o}{}{r}{}{a}{}{i}{}{n}$ | ${r}{}{a}{}{i}{}{n}{}{y}$ | ${l}{}{e}{}{a}{}{v}{}{e}{}{\mathrm{\_}}{}{i}{}{t}$ |
${r}{}{a}{}{i}{}{n}$ | ${s}{}{u}{}{n}{}{n}{}{y}$ | ${l}{}{e}{}{a}{}{v}{}{e}{}{\mathrm{\_}}{}{i}{}{t}$ |
${r}{}{a}{}{i}{}{n}$ | ${c}{}{l}{}{o}{}{u}{}{d}{}{y}$ | ${t}{}{a}{}{k}{}{e}{}{\mathrm{\_}}{}{i}{}{t}$ |
${r}{}{a}{}{i}{}{n}$ | ${r}{}{a}{}{i}{}{n}{}{y}$ | ${l}{}{e}{}{a}{}{v}{}{e}{}{\mathrm{\_}}{}{i}{}{t}$ |
Notice how the value for the decision variable is the one chosen by the policy. It only depends on the forecast.
The expected utility of this policy is obtained by averaging the utility over the worlds that satisfy this policy:
${\mathcal{E}}{}{(}{u}{\mid}{{\pi}}_{{1}}{)}{=}$ | ${P}{}{(}{n}{}{o}{}{r}{}{a}{}{i}{}{n}{)}{*}{P}{}{(}{s}{}{u}{}{n}{}{n}{}{y}{\mid}{n}{}{o}{}{r}{}{a}{}{i}{}{n}{)}{*}{u}{}{(}{n}{}{o}{}{r}{}{a}{}{i}{}{n}{,}{l}{}{e}{}{a}{}{v}{}{e}{}{\mathrm{\_}}{}{i}{}{t}{)}$ | ||
${+}{P}{}{(}{n}{}{o}{}{r}{}{a}{}{i}{}{n}{)}{*}{P}{}{(}{c}{}{l}{}{o}{}{u}{}{d}{}{y}{\mid}{n}{}{o}{}{r}{}{a}{}{i}{}{n}{)}{*}{u}{}{(}{n}{}{o}{}{r}{}{a}{}{i}{}{n}{,}{t}{}{a}{}{k}{}{e}{}{\mathrm{\_}}{}{i}{}{t}{)}$ | |||
${+}{P}{}{(}{n}{}{o}{}{r}{}{a}{}{i}{}{n}{)}{*}{P}{}{(}{r}{}{a}{}{i}{}{n}{}{y}{\mid}{n}{}{o}{}{r}{}{a}{}{i}{}{n}{)}{*}{u}{}{(}{n}{}{o}{}{r}{}{a}{}{i}{}{n}{,}{l}{}{e}{}{a}{}{v}{}{e}{}{\mathrm{\_}}{}{i}{}{t}{)}$ | |||
${+}{P}{}{(}{r}{}{a}{}{i}{}{n}{)}{*}{P}{}{(}{s}{}{u}{}{n}{}{n}{}{y}{\mid}{r}{}{a}{}{i}{}{n}{)}{*}{u}{}{(}{r}{}{a}{}{i}{}{n}{,}{l}{}{e}{}{a}{}{v}{}{e}{}{\mathrm{\_}}{}{i}{}{t}{)}$ | |||
${+}{P}{}{(}{r}{}{a}{}{i}{}{n}{)}{*}{P}{}{(}{c}{}{l}{}{o}{}{u}{}{d}{}{y}{\mid}{r}{}{a}{}{i}{}{n}{)}{*}{u}{}{(}{r}{}{a}{}{i}{}{n}{,}{t}{}{a}{}{k}{}{e}{}{\mathrm{\_}}{}{i}{}{t}{)}$ | |||
${+}{P}{}{(}{r}{}{a}{}{i}{}{n}{)}{*}{P}{}{(}{r}{}{a}{}{i}{}{n}{}{y}{\mid}{r}{}{a}{}{i}{}{n}{)}{*}{u}{}{(}{r}{}{a}{}{i}{}{n}{,}{l}{}{e}{}{a}{}{v}{}{e}{}{\mathrm{\_}}{}{i}{}{t}{)}{,}$ |
where ${n}{\mathit{}}{o}{\mathit{}}{r}{\mathit{}}{a}{\mathit{}}{i}{\mathit{}}{n}$ means ${W}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{t}{\mathit{}}{h}{\mathit{}}{e}{\mathit{}}{r}{\mathrm{=}}{n}{\mathit{}}{o}{\mathit{}}{r}{\mathit{}}{a}{\mathit{}}{i}{\mathit{}}{n}$, ${s}{\mathit{}}{u}{\mathit{}}{n}{\mathit{}}{n}{\mathit{}}{y}$ means ${F}{\mathit{}}{o}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{c}{\mathit{}}{a}{\mathit{}}{s}{\mathit{}}{t}{\mathrm{=}}{s}{\mathit{}}{u}{\mathit{}}{n}{\mathit{}}{n}{\mathit{}}{y}$, and similarly for the other values.
An optimal policy is a policy ${\pi}^{*}$ such that $\mathcal{E}(u\mid {\pi}^{*})\ge \mathcal{E}(u\mid \pi )$ for all policies $\pi $. That is, an optimal policy is a policy whose expected utility is maximal over all policies.
Suppose a binary decision node has $n$ binary parents. There are ${2}^{n}$ different assignments of values to the parents and, consequently, there are ${2}^{{2}^{n}}$ different possible decision functions for this decision node. The number of policies is the product of the number of decision functions for each of the decision variables. Even small examples can have a huge number of policies. An algorithm that simply enumerates the policies looking for the best one will be very inefficient.