12 Planning with Uncertainty 12.4 The Value of Information and Control 12.6 Social Impact

12.5 Decision Processes

Recall that the planning horizon is how far ahead an agent considers when planning. The decision networks of Section 12.3.1 were for finite-stage, partially observable domains. This section considers indefinite-horizon and infinite-horizon problems.

Often an agent must reason about an ongoing process or it does not know how many actions it will be required to do. These are called infinite-horizon problems when the process may go on forever or indefinite-horizon problems when the agent will eventually stop, but it does not know when it will stop.

For ongoing processes, it may not make sense to consider only the utility at the end, because the agent may never get to the end. Instead, an agent can receive a sequence of rewards. Rewards provide a way to factor utility through time, by having a reward for each time, and accumulating the rewards to determine utility. Rewards can incorporate action costs in addition to any prizes or penalties that may be awarded. Negative rewards are called punishments. Indefinite-horizon problems can be modeled using a stopping state. A stopping state or absorbing state is a state in which all actions have no effect; that is, when the agent is in that state, all actions immediately return to that state with a zero reward. Goal achievement can be modeled by having a reward for entering such a stopping state.

A Markov decision process can be seen as a Markov chain augmented with actions and rewards or as a decision network extended in time. At each stage, the agent decides which action to perform; the reward and the resulting state depend on both the previous state and the action performed.

Unless noted, assume a stationary model, where the state transitions and the rewards do not depend on the time.

A Markov decision process, or MDP, consists of

•

$S$ , a set of states of the world
•

$A$ , a set of actions
•

$P:S\times S\times A\rightarrow[0,1]$ , which specifies the dynamics. This is written as $P(s^{\prime}\mid s,a)$ , the probability of the agent transitioning into state $s^{\prime}$ given that the agent is in state $s$ and does action $a$ . Thus

$\forall s\in S~\forall a\in A~\sum_{s^{\prime}\in S}P(s^{\prime}\mid s,a)=1.$
•

$R:S\times A\times S\rightarrow\Re$ , where $R(s,a,s^{\prime})$ , the reward function, gives the expected immediate reward from doing action $a$ and transitioning to state $s^{\prime}$ from state $s$ . Sometimes it is convenient to use $R(s,a)$ , the expected value of doing $a$ in state $s$ , which is $R(s,a)=\sum_{s^{\prime}}R(s,a,s^{\prime})*P(s^{\prime}\mid s,a)$ .

A finite part of a Markov decision process can be depicted using a decision network as in Figure 12.15.

Refer to caption — Figure 12.15: Decision network representing a finite part of an MDP

Example 12.29.

Suppose Sam wanted to make an informed decision about whether to party or relax over the weekend. Sam prefers to party, but is worried about getting sick. Such a problem can be modeled as an MDP with two states, $h e a l t h y$ and $s i c k$ , and two actions, $r e l a x$ and $p a r t y$ . Thus

	$\displaystyle S$	$\displaystyle=\{healthy,sick\}$
	$\displaystyle A$	$\displaystyle=\{relax,party\}.$

Based on experience, Sam estimates that the dynamics $P(s^{\prime}\mid s,a)$ is given by

$S$	$A$	Probability of $s^{\prime}=healthy$
$h e a l t h y$	$r e l a x$	0.95
$h e a l t h y$	$p a r t y$	0.7
$s i c k$	$r e l a x$	0.5
$s i c k$	$p a r t y$	0.1

So, if Sam is healthy and parties, there is a 30% chance of becoming sick. If Sam is healthy and relaxes, Sam will more likely remain healthy. If Sam is sick and relaxes, there is a 50% chance of getting better. If Sam is sick and parties, there is only a 10% chance of becoming healthy.

Sam estimates the rewards to be the following, irrespective of the resulting state:

$S$	$A$	Reward
$h e a l t h y$	$r e l a x$	7
$h e a l t h y$	$p a r t y$	10
$s i c k$	$r e l a x$	0
$s i c k$	$p a r t y$	2

Thus, Sam always enjoys partying more than relaxing. However, Sam feels much better overall when healthy, and partying results in being sick more than relaxing does.

The problem is to determine what Sam should do each weekend.

Example 12.30.

A grid world is an idealization of a robot in an environment. At each time, the robot is at some location and can move to neighboring locations, collecting rewards and punishments. Suppose that the actions are stochastic, so that there is a probability distribution over the resulting states given the action and the state.

Figure 12.16 shows a $10\times 10$ grid world, where the robot can choose one of four actions: up, down, left, or right. If the agent carries out one of these actions, it has a $0.7$ chance of going one step in the desired direction and a $0.1$ chance of going one step in any of the other three directions. If it bumps into the outside wall (i.e., the location computed is outside the grid), there is a penalty of 1 (i.e., a reward of $-1$ ) and the agent does not actually move. There are four rewarding states (apart from the walls), one worth $+10$ (at position $(9,8)$ ; 9 across and 8 down), one worth $+3$ (at position $(8,3)$ ), one worth $-5$ (at position $(4,5)$ ), and one worth $-10$ (at position $(4,8)$ ). In each of these states, the agent gets the reward after it carries out an action in that state, not when it enters the state. When the agent reaches one of the states with positive reward (either $+3$ or $+10$ ), no matter what action it performs, at the next step it is flung, at random, to one of the four corners of the grid world.

Note that the reward in this example depends on both the initial state and the final state. The agent bumped into the wall, and so received a reward of $-1$ , if and only if the agent remains in the same state. Knowing just the initial state and the action, or just the final state and the action, does not provide enough information to infer the reward.

As with decision networks, the designer also has to consider what information is available to the agent when it decides what to do. There are two variations:

•

In a fully observable Markov decision process (MDP), the agent gets to observe the current state when deciding what to do.
•

A partially observable Markov decision process (POMDP) is a combination of an MDP and a hidden Markov model (HMM). At each time, the agent gets to make some (ambiguous and possibly noisy) observations that depend on the state. The agent only has access to the history of rewards, observations, and previous actions when making a decision. It cannot directly observe the current state.

Rewards

To decide what to do, the agent compares different sequences of rewards. The most common way to do this is to convert a sequence of rewards into a number called the return, the cumulative reward, or the value. This is a number that specifies the utility to an agent of the current and future rewards. To compute the return, the agent combines the current reward with other rewards in the future. Suppose the agent receives the sequence of rewards

r_{1},r_{2},r_{3},r_{4},\ldots.

Three common reward criteria are used to combine rewards into a value $V$ :

Total reward

$V=\sum_{i=1}^{\infty}r_{i}$ . In this case, the value is the sum of all of the rewards. This works when you can guarantee that the sum is finite; but if the sum is infinite, it does not give any opportunity to compare which sequence of rewards is preferable. For example, a sequence of 1 rewards has the same total as a sequence of 100 rewards (both are infinite). One case where the total reward is finite is when there are stopping states and the agent always has a non-zero probability of eventually entering a stopping state.

Average reward

$V=\lim_{n\rightarrow\infty}(r_{1}+\cdots+r_{n})/n$ . In this case, the agent’s value is the average of its rewards, averaged over for each time period. As long as the rewards are finite, this value will also be finite. However, whenever the total reward is finite, the average reward is zero, and so the average reward will fail to allow the agent to choose among different actions that each have a zero average reward. Under this criterion, the only thing that matters is where the agent ends up. Any finite sequence of bad actions does not affect the limit. For example, receiving 1,000,000 followed by rewards of 1 has the same average reward as receiving 0 followed by rewards of 1 (they both have an average reward of 1).

Discounted reward

$V=r_{1}+\gamma r_{2}+\gamma^{2}r_{3}+\cdots+\gamma^{i-1}r_{i}+\cdots$ , where $\gamma$ , the discount factor, is a number in the range $0\leq\gamma<1$ . Under this criterion, future rewards are worth less than the current reward. If $\gamma$ was 1, this would be the same as the total reward. When $\gamma=0$ , the agent ignores all future rewards. Having $0\leq\gamma<1$ guarantees that, whenever the rewards are finite, the total value will also be finite.

The discounted reward can be rewritten as

	$\displaystyle V$	$\displaystyle=\sum_{i=1}^{\infty}\gamma^{i-1}r_{i}$
		$\displaystyle=r_{1}+\gamma r_{2}+\gamma^{2}r_{3}+\cdots+\gamma^{i-1}r_{i}+\cdots$
		$\displaystyle=r_{1}+\gamma(r_{2}+\gamma(r_{3}+\cdots)).$

Suppose $V_{k}$ is the reward accumulated from time $k$ :

	$\displaystyle V_{k}$	$\displaystyle=r_{k}+\gamma(r_{k+1}+\gamma(r_{k+2}+\cdots))$
		$\displaystyle=r_{k}+\gamma V_{k+1}.$

To understand the properties of $V_{k}$ , suppose $S=1+\gamma+\gamma^{2}+\gamma^{3}+\cdots$ , then $S=1+\gamma S$ . Solving for $S$ gives $S=1/(1-\gamma)$ . Thus, with the discounted reward, the value of all of the future is at most $1/(1-\gamma)$ times as much as the maximum reward and at least $1/(1-\gamma)$ times as much as the minimum reward. Therefore, the eternity of time from now only has a finite value compared with the immediate reward, unlike the average reward, in which the immediate reward is dominated by the cumulative reward for the eternity of time.

In economics, $\gamma$ is related to the interest rate: getting $\$1$ now is equivalent to getting $\$(1+i)$ in one year, where $i$ is the interest rate. You could also see the discount rate as the probability that the agent survives; $\gamma$ can be seen as the probability that the agent keeps going.

The rest of this chapter considers discounted rewards, referred to as the return or value.

Foundations of Discounting

You might wonder whether there is something special about discounted rewards, or they are just an arbitrary way to define utility over time. Utility follows from the axioms of Section 12.1.1. Discounting can be proved to follow from a set of assumptions, as follows.

With an infinite sequence of outcomes $\left<o_{1},o_{2},o_{3},\dots\right>$ the following assumptions hold

•

the first time period matters, so there exist $o_{1},o_{2},o_{3},\dots$ and $o_{1}^{\prime}$ such that

$\left<o_{1},o_{2},o_{3},\dots\right>\succ\left<o_{1}^{\prime},o_{2},o_{3},% \dots\right>$
•

a form of additive independence, where preferences for the first two time periods do not depend on the future:

$\displaystyle\left<x_{1},x_{2},o_{3},o_{4}\dots\right>\succ\left<y_{1},y_{2},o% _{3},o_{4}\dots\right>$

if and only if

$\displaystyle\left<x_{1},x_{2},o_{3}^{\prime},o_{4}^{\prime}\dots\right>\succ% \left<y_{1},y_{2},o_{3}^{\prime},o_{4}^{\prime}\dots\right>$
•

time stationarity, where if the first outcome is the same, preference depends on the remainder:

$\displaystyle\left<o_{1},o_{2},o_{3},\dots\right>\succ\left<o_{1},o_{2}^{% \prime},o_{3}^{\prime},\dots\right>$

if and only if

$\displaystyle\left<o_{2},o_{3},\dots\right>\succ\left<o_{2}^{\prime},o_{3}^{% \prime},\dots\right>$
•

some extra technical conditions specifying that the agent is only concerned about finite subspaces of infinite time

if and only if there exists a discount factor $\gamma$ and function $r$ such that

utility(\left<o_{1},o_{2},o_{3},\dots\right>)=\sum_{i}\gamma^{i-1}r(o_{i})

where $r$ does not depend on the time, only the outcome. These are quite strong assumptions, for example, disallowing complements and substitutes. It is standard to engineer the rewards to make them true.

12.5.1 Policies

In a fully-observable Markov decision process, the agent gets to observe its current state before deciding which action to carry out. For now, assume that the Markov decision process is fully observable. A policy for an MDP specifies what the agent should do as a function of the state it is in. A stationary policy is a function $\pi:S\rightarrow A$ . In a non-stationary policy the action is a function of the state and the time; we assume policies are stationary.

Given a reward criterion, a policy has an expected return, often referred to as the expected value, for every state. Let $V^{\pi}(s)$ be the expected value of following $\pi$ in state $s$ . This is the utility an agent that is in state $s$ and following policy $\pi$ receives, on average. Policy $\pi$ is an optimal policy if for every policy $\pi^{\prime}$ and state $s$ , $V^{\pi}(s)\geq V^{\pi^{\prime}}(s)$ . That is, an optimal policy is at least as good as any other policy for every state.

Example 12.31.

For Example 12.29, with two states and two actions, there are $2^{2}=4$ policies:

•

Always relax.
•

Always party.
•

Relax if healthy and party if sick.
•

Party if healthy and relax if sick.

The total reward for all of these is infinite because the agent never stops, and can never continually get a reward of 0. To determine the average reward is left as an exercise (Exercise 12.15). How to compute the discounted reward is discussed in the next section.

Example 12.32.

In the grid-world MDP of Example 12.30, there are 100 states and 4 actions, therefore there are $4^{100}\approx 10^{60}$ stationary policies. Each policy specifies an action for each state.

For infinite-horizon problems, a stationary MDP always has an optimal stationary policy. However, for finite-stage problems, a non-stationary policy might be better than all stationary policies. For example, if the agent had to stop at time $n$ , for the last decision in some state, the agent would act to get the largest immediate reward without considering the future actions, but for earlier decisions it may decide to get a lower reward immediately to obtain a larger reward later.

Value of a Policy

Consider how to compute the expected value, using the discounted reward of a policy, given a discount factor of $\gamma$ . The value is defined in terms of two interrelated functions:

•

$V^{\pi}(s)$ is the expected value for an agent that is in state $s$ and following policy $\pi$ .
•

$Q^{\pi}(s,a)$ is the expected value for an agent that is starting in state $s$ , then doing action $a$ , then following policy $\pi$ . This is called the Q-value of policy $\pi$ .

$Q^{\pi}$ and $V^{\pi}$ are defined recursively in terms of each other. If the agent is in state $s$ , performs action $a$ , and arrives in state $s^{\prime}$ , it gets the immediate reward of $R(s,a,s^{\prime})$ plus the discounted future return, $\gamma V^{\pi}(s^{\prime})$ . When the agent is planning it does not know the actual resulting state, so it uses the expected value, averaged over the possible resulting states:

	$\displaystyle Q^{\pi}(s,a)$	$\displaystyle=\sum_{s^{\prime}}P(s^{\prime}\mid s,a)(R(s,a,s^{\prime})+\gamma V% ^{\pi}(s^{\prime}))$
		$\displaystyle=R(s,a)+\gamma\sum_{s^{\prime}}P(s^{\prime}\mid s,a)V^{\pi}(s^{% \prime})$		(12.2)

where $R(s,a)=\sum_{s^{\prime}}P(s^{\prime}\mid s,a)R(s,a,s^{\prime})$ .

$V^{\pi}(s)$ is obtained by doing the action specified by $\pi$ and then following $\pi$ :

\displaystyle V^{\pi}(s)

\displaystyle=Q^{\pi}(s,\pi(s)).

Value of an Optimal Policy

Let $Q^{*}(s,a)$ , where $s$ is a state and $a$ is an action, be the expected value of doing $a$ in state $s$ and then following the optimal policy. Let $V^{*}(s)$ , where $s$ is a state, be the expected value of following an optimal policy from state $s$ .

$Q^{*}$ can be defined analogously to $Q^{\pi}$ :

	$\displaystyle Q^{*}(s,a)$	$\displaystyle=\sum_{s^{\prime}}P(s^{\prime}\mid s,a)(R(s,a,s^{\prime})+\gamma V% ^{*}(s^{\prime}))$
		$\displaystyle=R(s,a)+\gamma\sum_{s^{\prime}}P(s^{\prime}\mid s,a)\gamma V^{*}(% s^{\prime}).$

$V^{*}(s)$ is obtained by performing the action that gives the best value in each state:

\displaystyle V^{*}(s)

\displaystyle=\max_{a}Q^{*}(s,a).

An optimal policy $\pi^{*}$ is one of the policies that gives the best value for each state:

\displaystyle\pi^{*}(s)

\displaystyle=\arg\max_{a}Q^{*}(s,a)

where $\arg\max_{a}Q^{*}(s,a)$ is a function of state $s$ , and its value is an action $a$ that results in the maximum value of $Q^{*}(s,a)$ .

12.5.2 Value Iteration

Value iteration is a method of computing an optimal policy for an MDP and its value.

Value iteration starts at the “end” and then works backward, refining an estimate of either $Q^{*}$ or $V^{*}$ . There is really no end, so it uses an arbitrary end point. Let $V_{k}$ be the value function assuming there are $k$ stages to go, and let $Q_{k}$ be the $Q$ -function assuming there are $k$ stages to go. These can be defined recursively. Value iteration starts with an arbitrary function $V_{0}$ . For subsequent stages, it uses the following equations to get the functions for $k+1$ stages to go from the functions for $k$ stages to go:

	$\displaystyle Q_{k+1}(s,a)$	$\displaystyle=R(s,a)+\gamma\sum_{s^{\prime}}P(s^{\prime}\mid s,a)V_{k}(s^{% \prime})$
	$\displaystyle V_{k}(s)$	$\displaystyle=\max_{a}Q_{k}(s,a).$

Value iteration can either save a $V[S]$ array or a $Q[S,A]$ array. Saving the $V$ array results in less storage, but it is more difficult to determine an optimal action, and one more iteration is needed to determine which action results in the greatest value.

1: procedure Value_iteration(

S, A, P, R

)

2: Inputs

S

is the set of all states

A

is the set of all actions

P

is the state transition function specifying

P(s^{\prime}\mid s,a)

R

is a reward function

R(s,a)

7: Output

\pi[S]

approximately optimal policy

V[S]

value function

10: Local

11: real array

V_{k}[S]

is a sequence of value functions

12: action array

\pi[S]

13: assign

V_{0}[S]

arbitrarily

14:

k\;{:}{=}\;\mbox{}0

15: repeat

16:

k\;{:}{=}\;\mbox{}k+1

17: for each state

s

18:

V_{k}[s]=\max_{a}~R(s,a)+\gamma*\sum_{s^{\prime}}P(s^{\prime}\mid s,a)*V_{k-1}% [s^{\prime}]

19: until termination

20: for each state

s

21:

\pi[s]=\arg\max_{a}~R(s,a)+\gamma*\sum_{s^{\prime}}P(s^{\prime}\mid s,a)*V_{k}% [s^{\prime}]

22: return

\pi,V_{k}

Figure 12.17: Value iteration for MDPs, storing

V

Figure 12.17 shows the value iteration algorithm when the $V$ array is stored. This procedure converges no matter what the initial value function $V_{0}$ is. An initial value function that approximates $V^{*}$ converges quicker than one that does not. The basis for many abstraction techniques for MDPs is to use some heuristic method to approximate $V^{*}$ and to use this as an initial seed for value iteration.

Example 12.33.

Consider the two-state MDP of Example 12.29 with discount $\gamma=0.8$ . We write the value function as $[healthy\_value,sick\_value]$ and the Q-function as $[[healthy\_relax,healthy\_party],[sick\_relax,sick\_party]]$ . Suppose initially the value function is $[0,0]$ . The next Q-value is $[[7,10],[0,2]]$ , so the next value function is $[10,2]$ (obtained by Sam partying). The next Q-value is then

$S t a t e$	$A c t i o n$	Value
$h e a l t h y$	$r e l a x$	$7+0.8(0.9510+0.05*2)$	=	14.68
$h e a l t h y$	$p a r t y$	$10+0.8(0.710+0.3*2)$	=	16.08
$s i c k$	$r e l a x$	$0+0.8(0.510+0.5*2)$	=	4.8
$s i c k$	$p a r t y$	$2+0.8(0.110+0.9*2)$	=	4.24

So the next value function is $[16.08,4.8]$ . After 1000 iterations, the value function is $[35.71,23.81]$ . So the $Q$ -function is $[[35.10,35.71],[23.81,22.0]]$ . Therefore, the optimal policy is to party when healthy and relax when sick.

Example 12.34.

Consider the nine squares around the $+10$ reward of Example 12.30. The discount is $\gamma=0.9$ . Suppose the algorithm starts with $V_{0}[s]=0$ for all states $s$ .

The values of $V_{1}$ , $V_{2}$ , and $V_{3}$ (to one decimal point) for these nine cells are

0	0	$-0.1$
0	10	$-0.1$
0	0	$-0.1$
$V_{1}$

0	$6.3$	$-0.1$
$6.3$	$9.8$	$6.2$
0	$6.3$	$-0.1$
$V_{2}$

$4.5$	$6.2$	$4.4$
$6.2$	$9.7$	$6.6$
$4.5$	$6.1$	$4.4$
$V_{3}$

After the first step of value iteration (in $V_{1}$ ), the nodes get their immediate expected reward. The center node in this figure is the $+10$ reward state. The right nodes have a value of $-0.1$ , with the optimal actions being up, left, and down; each of these has a $0.1$ chance of crashing into the wall for an immediate expected reward of $-1$ .

$V_{2}$ are the values after the second step of value iteration. Consider the node that is immediately to the left of the $+10$ reward state. Its optimal value is to go to the right; it has a 0.7 chance of getting a reward of 10 in the following state, so that is worth 9 (10 times the discount of $0.9$ ) to it now. The expected reward for the other possible resulting states is $0$ . Thus, the value of this state is $0.7*9=6.3$ .

Consider the node immediately to the right of the $+10$ reward state after the second step of value iteration. The agent’s optimal action in this state is to go left. The value of this state is

\begin{array}[]{rllcll}&\mbox{Prob}&\mbox{Reward}&&\mbox{Future Value }&\\ \hline\cr&0.7*(&0&+&0.9*10)&\mbox{\emph{Agent goes left}}\\ +&0.1*(&0&+&0.9*-0.1)&\mbox{\emph{Agent goes up}}\\ +&0.1*(&-1&+&0.9*-0.1)&\mbox{\emph{Agent goes right}}\\ +&0.1*(&0&+&0.9*-0.1)&\mbox{\emph{Agent goes down}}\end{array}

which evaluates to 6.173, which is approximated to $6.2$ in $V_{2}$ above.

The $+10$ reward state has a value less than 10 in $V_{2}$ because the agent gets flung to one of the corners and these corners look bad at this stage.

After the next step of value iteration, shown on the right-hand side of the figure, the effect of the +10 reward has progressed one more step. In particular, the corners shown get values that indicate a reward in three steps.

The value iteration algorithm of Figure 12.17 has an array for each stage, but it really only needs to store the current and previous arrays. It can update one array based on values from the other.

A common refinement of this algorithm is asynchronous value iteration. Rather than sweeping through the states to create a new value function, asynchronous value iteration updates the states one at a time, in any order, and stores the values in a single array. Asynchronous value iteration can store either the $Q[s,a]$ array or the $V[s]$ array. Figure 12.18 shows asynchronous value iteration when the $Q$ -array is stored. It converges faster than value iteration and is the basis of some of the algorithms for reinforcement learning. Termination can be difficult to determine if the agent must guarantee a particular error, unless it is careful about how the actions and states are selected. Often, this procedure is run indefinitely as an anytime algorithm, where it is always prepared to give its best estimate of the optimal action in a state when asked.

1: procedure Asynchronous_value_iteration(

S, A, P, R

)

2: Inputs

S

is the set of all states

A

is the set of all actions

P

is the state transition function specifying

P(s^{\prime}\mid s,a)

R

is a reward function

R(s,a)

7: Output

\pi[s]

policy

Q[S,A]

value function

10: Local

11: real array

Q[S,A]

12: action array

\pi[S]

13: assign

Q[S,A]

arbitrarily

14: repeat

15: select a state

s

16: select an action

a

17:

Q[s,a]=R(s,a)+\gamma*\sum_{s^{\prime}}P(s^{\prime}\mid s,a)*\max_{a^{\prime}}Q% [s^{\prime},a^{\prime}]

18: until termination

19: for each state

s

20:

\pi[s]=\arg\max_{a}Q[s,a]

21: return

\pi,Q

Figure 12.18: Asynchronous value iteration for MDPs

Asynchronous value iteration could also be implemented by storing just the $V[s]$ array. In that case, the algorithm selects a state $s$ and carries out the update

V[s]\;{:}{=}\;\mbox{}\max_{a}R(s,a)+\gamma*\sum_{s^{\prime}}P(s^{\prime}\mid s% ,a)*V[s^{\prime}].

Although this variant stores less information, it is more difficult to extract the policy. It requires one extra backup to determine which action $a$ results in the maximum value. This can be done using

\pi[s]\;{:}{=}\;\mbox{}\arg\max_{a}R(s,a)+\gamma*\sum_{s^{\prime}}P(s^{\prime}% \mid s,a)*V[s^{\prime}].

Example 12.35.

In Example 12.34, the state one step up and one step to the left of the +10 reward state only had its value updated after three value iterations, in which each iteration involved a sweep through all of the states.

In asynchronous value iteration, the $+10$ reward state can be chosen first. Next, the node to its left can be chosen, and its value will be $0.7*0.9*10=6.3$ . Next, the node above that node could be chosen, and its value would become $0.7*0.9*6.3=3.969$ . Note that it has a value that reflects that it is close to a $+10$ reward after considering three states, not 300 states, as does value iteration.

12.5.3 Policy Iteration

Policy iteration starts with a policy and iteratively improves it. It starts with an arbitrary policy $\pi_{0}$ (an approximation to the optimal policy works best) and carries out the following steps, starting from $i=0$ .

•

Policy evaluation: determine $V^{\pi_{i}}(S)$ . The definition of $V^{\pi}$ is a set of $|S|$ linear equations in $|S|$ unknowns. The unknowns are the values of $V^{\pi_{i}}(S)$ . There is an equation for each state. These equations can be solved by a linear equation solution method (such as Gaussian elimination) or they can be solved iteratively.
•

Policy improvement: choose $\pi_{i+1}(s)=\arg\max_{a}Q^{\pi_{i}}(s,a)$ , where the $Q$ -value can be obtained from $V$ using Equation 12.2. To detect when the algorithm has converged, it should only change the policy if the new action for some state improves the expected value; that is, it should set $\pi_{i+1}(s)$ to be $\pi_{i}(s)$ if $\pi_{i}(s)$ is one of the actions that maximizes $Q^{\pi_{i}}(s,a)$ .
•

Stop if there is no change in the policy, if $\pi_{i+1}=\pi_{i}$ , otherwise increment $i$ and repeat.

1: procedure Policy_iteration(

S, A, P, R

)

2: Inputs

S

is the set of all states

A

is the set of all actions

P

is the state transition function specifying

P(s^{\prime}\mid s,a)

R

is a reward function

R(s,a)

7: Output

8: optimal policy

\pi

9: Local

10: action array

\pi[S]

11: Boolean variable

n o C h a n g e

12: real array

V[S]

13: set

\pi

arbitrarily

14: repeat

15:

noChange\;{:}{=}\;\mbox{}true

16: Solve

V[s]=R(s,a)+\gamma*\sum_{s^{\prime}\in S}P(s^{\prime}\mid s,\pi[s])*V[s^{% \prime}]

17: for each

s\in S

18:

QBest\;{:}{=}\;\mbox{}V[s]

19: for each

a\in A

20:

Qsa\;{:}{=}\;\mbox{}R(s,a)+\gamma*\sum_{s^{\prime}\in S}P(s^{\prime}\mid s,a)*% V[s^{\prime}]

21: if

Qsa>QBest

then

22:

\pi[s]\;{:}{=}\;\mbox{}a

23:

QBest\;{:}{=}\;\mbox{}Qsa

24:

noChange\;{:}{=}\;\mbox{}false

25: until

n o C h a n g e

26: return

\pi

Figure 12.19: Policy iteration for MDPs

The algorithm is shown in Figure 12.19. Note that it only keeps the latest policy and notices if it has changed. This algorithm always halts, usually in a small number of iterations. Unfortunately, solving the set of linear equations is often time consuming.

A variant of policy iteration, called modified policy iteration, is obtained by noticing that the agent is not required to evaluate the policy to improve it; it can just carry out a number of backup steps using Equation 12.2 and then do an improvement.

Policy iteration is useful for systems that are too big to be represented explicitly as MDPs. One case is when there is a large action space, and the agent does not want to enumerate all actions at each time. The algorithm also works as long as an improving action is found, and it only needs to find an improving action probabilistically, for example, by testing some promising actions, rather than all.

Suppose a controller has some parameters that can be varied. An estimate of the derivative of the cumulative discounted reward of a parameter $a$ in some context $s$ , which corresponds to the derivative of $Q(a,s)$ , can be used to improve the parameter. Such an iteratively improving controller can get into a local maximum that is not a global maximum. Policy iteration for state-based MDPs does not result in non-optimal local maxima, because it is possible to improve an action for a state without affecting other states, whereas updating parameters can affect many states at once.

12.5.4 Dynamic Decision Networks

A Markov decision process is a state-based representation. Just as in classical planning, where reasoning in terms of features can allow for more straightforward representations and more efficient algorithms, planning under uncertainty can also take advantage of reasoning in term of features. This forms the basis for decision-theoretic planning.

A dynamic decision network (DDN) can be seen in a number of different ways:

•

a factored representation of MDPs, where the states are described in terms of features
•

an extension of decision networks to allow repeated structure for indefinite or infinite-horizon problems
•

an extension of dynamic belief networks to include actions and rewards
•

an extension of the feature-based representation of actions or the CSP representation of planning to allow for rewards and for uncertainty in the effect of actions.

A dynamic decision network consists of

•

a set of state features
•

a set of possible actions
•
a two-stage decision network with chance nodes $F_{0}$ and $F_{1}$ for each feature $F$ (for the features at time 0 and time 1, respectively) and decision node $A_{0}$ , such that
- –
  
  the domain of $A_{0}$ is the set of all actions
- –
  
  the parents of $A_{0}$ are the set of time 0 features (these arcs are often not shown explicitly)
- –
  
  the parents of time 0 features do not include $A_{0}$ or time 1 features, but can include other time 0 features as long as the resulting network is acyclic
- –
  
  the parents of time 1 features can contain $A_{0}$ and other time 0 or time 1 features as long as the graph is acyclic
- –
  
  there are probability distributions for $P(F_{0}\mid parents(F_{0}))$ and $P(F_{1}\mid parents(F_{1}))$ for each feature $F$
- –
  
  the reward function depends on any subset of the action and the features at times 0 or 1.

As in a dynamic belief network, a dynamic decision network can be unfolded into a decision network by replicating the features and the action for each subsequent time. For a time horizon of $n$ , there is a variable $F_{i}$ for each feature $F$ and for each time $i$ for $0\leq i\leq n$ . For a time horizon of $n$ , there is a variable $A_{i}$ for each time $i$ for $0\leq i<n$ . The horizon, $n$ , can be unbounded, which allows us to model processes that do not halt.

Thus, if there are $k$ features for a time horizon of $n$ , there are $k*(n+1)$ chance nodes (each representing a random variable) and $k$ decision nodes in the unfolded network.

The parents of $A_{i}$ are random variables $F_{i}$ (so that the agent can observe the state). Each $F_{i+1}$ depends on the action $A_{i}$ and the features at time $i$ and $i+1$ in the same way, with the same conditional probabilities, as $F_{1}$ depends on the action $A_{0}$ and the features at time $0$ and $1$ . The $F_{0}$ variables are modeled directly in the two-stage decision network.

Example 12.36.

Example 6.1 models a robot that can deliver coffee and mail in a simple environment with four locations. Consider representing a stochastic version of Example 6.1 as a dynamic decision network. We use the same features as in that example.

Feature $R L o c$ models the robot’s location. The parents of variables $RLoc_{1}$ are $Rloc_{0}$ and $A$ .

Feature $R H C$ is true when the robot has coffee. The parents of $RHC_{1}$ are $RHC_{0}$ , $A_{0}$ , and $RLoc_{0}$ ; whether the robot has coffee depends on whether it had coffee before, what action it performed, and its location. The probabilities can encode the possibilities that the robot does not succeed in picking up or delivering the coffee, that it drops the coffee, or that someone gives it coffee in some other state (which we may not want to say is impossible).

Variable $S W C$ is true when Sam wants coffee. The parents of $SWC_{1}$ include $SWC_{0}$ , $RHC_{0}$ , $A_{0}$ , and $RLoc_{0}$ . You would not expect $RHC_{1}$ and $SWC_{1}$ to be independent because they both depend on whether or not the coffee was successfully delivered. This could be modeled by having one be a parent of the other.

The two-stage belief network representing how the state variables at time 1 depend on the action and the other state variables is shown in Figure 12.20. This figure also shows the reward as a function of the action, whether Sam stopped wanting coffee, and whether there is mail waiting.

Figure 12.21 shows the unfolded decision network for a horizon of 3.

Example 12.37.

An alternate way to model the dependence between $RHC_{1}$ and $SWC_{1}$ is to introduce a new variable, $CSD_{1}$ , which represents whether coffee was successfully delivered at time $1$ . This variable is a parent of both $RHC_{1}$ and $SWC_{1}$ . Whether Sam wants coffee is a function of whether Sam wanted coffee before and whether coffee was successfully delivered. Whether the robot has coffee depends on the action and the location, to model the robot picking up coffee. Similarly, the dependence between $MW_{1}$ and $RHM_{1}$ can be modeled by introducing a variable $MPU_{1}$ , which represents whether the mail was successfully picked up. The resulting DDN unfolded to a horizon of 2, but omitting the reward, is shown in Figure 12.22.

If the reward comes only at the end, variable elimination for decision networks, shown in Figure 12.14, can be applied directly. Variable elimination for decision networks corresponds to value iteration. Note that in fully observable decision networks, variable elimination does not require the no-forgetting condition. Once the agent knows the state, all previous decisions are irrelevant. If rewards are accrued at each time step, the algorithm must be augmented to allow for the addition (and discounting) of rewards. See Exercise 12.19.

12.5.5 Partially Observable Decision Processes

A partially observable Markov decision process (POMDP) is a combination of an MDP and a hidden Markov model (HMM). Whereas the state in an MPD is assumed to be fully observable, the environment state in a POMDP is partially observable, which means the agent receives partial and/or noisy observations of the environment before it has to act.

A POMDP can be expressed as the infinite extension of the decision network of Figure 12.23, which explicitly shows the belief state of the agent. The network extends indefinitely to the right.

A POMDP consists of the following variables and factors defining the external behavior of the agent:

•

$S$ , a set of states of the world
•

$A$ , a set of actions
•

$O$ , a set of possible observations
•

$P(S_{0})$ , the probability distribution of the starting state
•

$P(S^{\prime}\mid S,A)$ , the dynamics of the environment, is the probability of getting to state $S^{\prime}$ by doing action $A$ from state $S$
•

$R(S,A)$ , the expected reward of starting in state $S$ , doing action $A$
•

$P(O\mid S,A,R)$ , the probability of observing $O$ given the state is $S$ , the previous action is $A$ , and the reward is $R$ .

$P(S^{\prime}\mid S,A)$ and $R(S,A)$ are the same as in an MDP. The arc from $S_{i}$ to $O_{i}$ means that what the agent observes can depend on the state. The arc from $A_{i-1}$ to $O_{i}$ allows the agent to have actions that don’t affect the environment, but affect its observations, such as paying attention to some part of the environment. The arc from $R_{i-1}$ to $O_{i}$ indicates that the agent’s observation can depend on the reward received; the agent is not assumed to observe the reward directly, as sometimes the rewards are only seen in retrospect. Observing the reward can often provide hints about the state that the agent might not actually be able to observe.

Internally, an agent has

•

$B$ , the set of possible belief states
•

$T(B,A,O)$ , a belief state transition function, which specifies the agent’s new belief state given the previous belief state $B$ , the action the agent did, $A$ , and the observation, $O$ ; the belief state at stage $i$ is $B_{i}=T(B_{i-1},A_{i-1},O_{i})$ .
•

$\pi(A_{i}\mid B_{i},O_{i})$ , a command function or policy, which specifies a conditional plan defining what the agent will do as a function of the belief state and the observation.

A policy might be stochastic to allow for exploration or to confound other agents. The belief-state transition function is typically deterministic, representing probability distributions, that are updated based on the action and new observations.

Planning in a POMDP involves creating both a belief state transition function and a command function. The $B_{i}$ variables are special in that the world does not specify the domain or structure of these variables; an agent or its designer gets to choose the structure of a belief state, and how the agent acts based on its belief state, previous action, and latest observations. The belief state encodes all that the agent has remembered about its history.

There are a number of ways to represent a belief state and to find the optimal policy:

•

The decision network of Figure 12.23 can be solved using variable elimination for decision networks, shown in Figure 12.14, extended to include discounted rewards. Adding the no-forgetting arcs is equivalent to a belief state being the sequence of observations and actions; $B_{i}$ is $A_{i-1}$ and $O_{i}$ appended to the sequence $B_{i-1}$ . The problem with this approach is that the history is unbounded, and the size of a policy is exponential in the length of the history. This is only practical when the history is short or is deliberately cut short.
•

The belief state can be a probability distribution over the states of the environment. Maintaining the belief state is then the problem of filtering in the associated hidden Markov model. The problem with this approach is that, with $n$ world states, the set of belief states is an $(n-1)$ -dimensional real space. The agent then needs to find the optimal function from this multidimensional real space into the actions. The value of a sequence of actions only depends on the world states; as the belief state is a probability distribution over world states, the expected value of a belief state is a linear function of the values of the world states. Not all points in $\Re^{n-1}$ are possible belief states, only those resulting from sequences of observations are possible. Policies can be represented as a decision tree with conditions being (functions of) the observations. The observations dictate which path in the tree the agent will follow. The value function for an agent choosing the best policy for any finite look-ahead is piecewise linear and convex. Although this is simpler than a general $\Re^{n-1}$ function, the number of conditional policies grows like $O^{t}$ , where $O$ is the set of possible observations, and $t$ is the stage.
•

In general, the belief state can be anything. A belief state can include probability distributions over some of the variables, and remembered observations and actions. The agent can search over the space of controllers for the best controller. Thus, the agent searches over what to remember (the belief state) and what to do based on its belief state. Note that the first two proposals are instances of this approach: the agent’s belief state is all of its history, or the agent’s belief state is a probability distribution over world states, both of which are intractable. Because the general case is unconstrained over what to remember, the search space is enormous.

	$\displaystyle\left<x_{1},x_{2},o_{3},o_{4}\dots\right>\succ\left<y_{1},y_{2},o% _{3},o_{4}\dots\right>$
	if and only if
	$\displaystyle\left<x_{1},x_{2},o_{3}^{\prime},o_{4}^{\prime}\dots\right>\succ% \left<y_{1},y_{2},o_{3}^{\prime},o_{4}^{\prime}\dots\right>$

	$\displaystyle\left<o_{1},o_{2},o_{3},\dots\right>\succ\left<o_{1},o_{2}^{% \prime},o_{3}^{\prime},\dots\right>$
	if and only if
	$\displaystyle\left<o_{2},o_{3},\dots\right>\succ\left<o_{2}^{\prime},o_{3}^{% \prime},\dots\right>$

Artificial Intelligence 3E

12.5 Decision Processes

Example 12.29.

Example 12.30.

Rewards

12.5.1 Policies

Example 12.31.

Example 12.32.

Value of a Policy

Value of an Optimal Policy

12.5.2 Value Iteration

Example 12.33.

Example 12.34.

Example 12.35.

12.5.3 Policy Iteration

12.5.4 Dynamic Decision Networks

Example 12.36.

Example 12.37.

12.5.5 Partially Observable Decision Processes

Artificial
Intelligence 3E