9.3 Sequential Decisions

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

9.3.2 Policies

A policy specifies what the agent should do under all contingencies. A policy consists of a decision function for each decision variable. A decision function for a decision variable is a function that specifies a value for the decision variable for each assignment of values to its parents. Thus, a policy specifies, for each decision variable, what the agent will do for each of the possible observations.

Example 9.16.

In Example 9.13, some of the policies are:

  • Always bring the umbrella.

  • Bring the umbrella only if the forecast is “rainy.”

  • Bring the umbrella only if the forecast is “sunny.”

There are eight different policies, because there are three possible forecasts and there are two choices for each forecast.

Example 9.17.

In Example 9.15, a policy specifies a decision function for Check_smoke and a decision function for Call. Some of the policies are:

  • Never check for smoke, and call only if there is a report.

  • Always check for smoke, and call only if it sees smoke.

  • Check for smoke if there is a report, and call only if there is a report and it sees smoke.

  • Check for smoke if there is no report, and call when it does not see smoke.

  • Always check for smoke and never call.

There are four decision functions for Check_smoke. There are 28 decision functions for Call; for each of the eight assignments of values to the parents of Call, the agent can choose to call or not. Thus there are 4*28=1024 different policies.

Expected Utility of a Policy

Each policy has an expected utility for an agent that follows the policy. A rational agent should adopt the policy that maximizes its expected utility.

A possible world specifies a value for each random variable and each decision variable. A possible world ω satisfies policy π if for every decision variable D, D(ω) has the value specified by the policy given the values of the parents of D in the possible world.

A possible world corresponds to a complete history and specifies the values of all random and decision variables, including all observed variables. Possible world ω satisfies policy π if ω is one possible unfolding of history given that the agent follows policy π.

The expected utility of policy π is

(uπ)=ω satisfies πu(ω)*P(ω),

where P(ω), the probability of world ω, is the product of the probabilities of the values of the chance nodes given their parents’ values in ω, and u(ω) is the value of the utility u in world ω.

Example 9.18.

Consider Example 9.13, let π1 be the policy to take the umbrella if the forecast is cloudy and to leave it at home otherwise. The worlds that satisfy this policy are:

Weather Forecast Umbrella
norain sunny leave_it
norain cloudy take_it
norain rainy leave_it
rain sunny leave_it
rain cloudy take_it
rain rainy leave_it

Notice how the value for the decision variable is the one chosen by the policy. It only depends on the forecast.

The expected utility of this policy is obtained by averaging the utility over the worlds that satisfy this policy:

(uπ1)= P(norain)*P(sunnynorain)*u(norain,leave_it)
+P(norain)*P(cloudynorain)*u(norain,take_it)
+P(norain)*P(rainynorain)*u(norain,leave_it)
+P(rain)*P(sunnyrain)*u(rain,leave_it)
+P(rain)*P(cloudyrain)*u(rain,take_it)
+P(rain)*P(rainyrain)*u(rain,leave_it),

where norain means Weather=norain, sunny means Forecast=sunny, and similarly for the other values.

An optimal policy is a policy π* such that (uπ*)(uπ) for all policies π. That is, an optimal policy is a policy whose expected utility is maximal over all policies.

Suppose a binary decision node has n binary parents. There are 2n different assignments of values to the parents and, consequently, there are 22n different possible decision functions for this decision node. The number of policies is the product of the number of decision functions for each of the decision variables. Even small examples can have a huge number of policies. An algorithm that simply enumerates the policies looking for the best one will be very inefficient.