Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

### 9.5.5 Dynamic Decision Networks

The MDP is a state-based representation. In this
section, we consider a feature-based extension of MDPs, which forms the basis for
what is known as **decision-theoretic planning**.

The representation of a **dynamic decision network** (DDN) can be seen in a number of
different ways:

- a factored representation of MDPs, where the states are described in terms of features;
- an extension of decision networks to allow repeated structure for ongoing actions and state changes;
- an extension of dynamic belief networks to include actions and rewards; and
- an extension of the feature-based representation of actions to allow for uncertainty in the effect of actions.

A fully observable dynamic decision network consists of

- a set of state features, each with a domain;
- a set of possible actions forming a decision node
*A*, with domain the set of actions; - a two-stage belief network with an action node
*A*, nodes*F*and_{0}*F*for each feature_{1}*F*(for the features at time 0 and time 1, respectively), and a conditional probability*P(F*such that the parents of_{1}|parents(F_{1}))*F*can include_{1}*A*and features at times*0*and*1*as long as the resulting network is acyclic; and - a reward function that can be a function of the action and any of the features at times 0 or 1.

As in a dynamic belief network, the features at time 1 can be replicated for each subsequent time.

**Example 9.28:**Consider representing a stochastic version of Example 8.1 as a dynamic decision network. We use the same features as in that example.

The parents of *RLoc _{1}* are

*Rloc*and

_{0}*A*. The parents of

*RHC*are

_{1}*RHC*,

_{0}*A*, and

*RLoc*; whether the robot has coffee depends on whether it had coffee before, what action it performed, and its location.

_{0}The parents of *SWC _{1}* include

*SWC*,

_{0}*RHC*,

_{0}*A*, and

*RLoc*. You would not expect

_{0}*RHC*and

_{1}*SWC*to be independent because they both depend on whether or not the coffee was successfully delivered. This could be modeled by having one be a parent of the other. The two-stage belief network of how the state variables at time 1 depends on the action and the other state variables is shown in Figure 9.17. This figure also shows the reward as a function of the action, whether Sam stopped wanting coffee, and whether there is mail waiting.

_{1}An alternative way to model the dependence between *RHC _{1}* and

*SWC*is to introduce a new variable,

_{1}*CSD*, which represents whether coffee was successfully delivered at time

_{1}*1*. This variable is a parent of both

*RHC*and

_{1}*SWC*. Whether Sam wants coffee is a function of whether Sam wanted coffee before and whether coffee was successfully delivered. Whether the robot has coffee depends on the action and the location, to model the robot picking up coffee. Similarly, the dependence between

_{1}*MW*and

_{1}*RHM*can be modeled by introducing a variable

_{1}*MPU*, which represents whether the mail was successfully picked up. The resulting DDN replicated to a horizon of 2, but omitting the reward, is shown in Figure 9.18.

_{1}As part of such a decision network, we should also model the information
available to the actions and the rewards. In a **fully
observable dynamic decision network**, the parents of the action are
all the previous state variables. Because this can be inferred, the
arcs are typically not explicitly drawn. If the reward comes only at the
end, variable elimination for decision networks, shown in
Figure 9.11, can be applied directly. Note that we do not
require the no-forgetting condition for this to work; the fully
observable condition suffices. If rewards are accrued at each time
step, the algorithm must be augmented to allow for the
addition of rewards. See Exercise 9.12.