9.5 Decision Processes

9.5.2 Value Iteration

Value iteration is a method of computing an optimal policy for an MDP and its value.

Value iteration starts at the “end” and then works backward, refining an estimate of either Q* or V*. There is really no end, so it uses an arbitrary end point. Let Vk be the value function assuming there are k stages to go, and let Qk be the Q-function assuming there are k stages to go. These can be defined recursively. Value iteration starts with an arbitrary function V0. For subsequent stages, it uses the following equations to get the functions for k+1 stages to go from the functions for k stages to go

Qk+1(s,a) =R(s,a)+γ*sP(ss,a)*Vk(s)
Vk(s) =maxaQk(s,a)

It can either save the V[S] array or the Q[S,A] array. Saving the V array results in less storage, but it is more difficult to determine an optimal action, and one more iteration is needed to determine which action results in the greatest value.

1: procedure Value_iteration(S,A,P,R)
2:      Inputs
3:          S is the set of all states
4:          A is the set of all actions
5:          P is state transition function specifying P(ss,a)
6:          R is a reward function R(s,a)      
7:      Output
8:          π[S] approximately optimal policy
9:          V[S] value function      
10:      Local
11:          real array Vk[S] is a sequence of value functions
12:          action array π[S]      
13:      assign V0[S] arbitrarily
14:      k:=0
15:      repeat
16:          k:=k+1
17:          for each state s do
18:               Vk[s]=maxaR(s,a)+γ*sP(ss,a)*Vk-1[s]          
19:      until termination
20:      for each state s do
21:          π[s]=argmaxaR(s,a)+γ*sP(ss,a)*Vk[s]      
22:      return π,Vk
Figure 9.16: Value iteration for MDPs, storing V

Figure 9.16 shows the value iteration algorithm when the V array is stored. This procedure converges no matter what the initial value function V0 is. An initial value function that approximates V* converges quicker than one that does not. The basis for many abstraction techniques for MDPs is to use some heuristic method to approximate V* and to use this as an initial seed for value iteration.

Example 9.31.

Consider the two-state MDP of Example 9.27 with discount γ=0.8. We write the value function as [healthy_value,sick_value], and the Q-function as [[healthy_relax,healthy_party],[sick_relax,sick_party]]. Suppose initially the value function is [0,0]. The next Q-value is [[7,10],[0,2]], so the next value function is [10,2] (obtained by Sam partying). The next Q-value is then

State Action Value
healthy relax 7+0.8*(0.95*10+0.05*2) = 14.68
healthy party 10+0.8*(0.7*10+0.3*2) = 16.08
sick relax 0+0.8*(0.5*10+0.5*2) = 4.8
sick party 2+0.8*(0.1*10+0.9*2) = 4.24

So the next value function is [16.08,4.8]. After 1000 iterations, the value function is [35.71,23.81]. So the Q function is [[35.10,35.71],[23.81,22.0]]. Therefore, the optimal policy is to party when healthy and relax when sick.

Example 9.32.

Consider the nine squares around the +10 reward of Example 9.28. The discount is γ=0.9. Suppose the algorithm starts with V0[s]=0 for all states s.

The values of V1, V2, and V3 (to one decimal point) for these nine cells are

0 0 -0.1
0 10 -0.1
0 0 -0.1
0 6.3 -0.1
6.3 9.8 6.2
0 6.3 -0.1
4.5 6.2 4.4
6.2 9.7 6.6
4.5 6.1 4.4

After the first step of value iteration (in V1) the nodes get their immediate expected reward. The center node in this figure is the +10 reward state. The right nodes have a value of -0.1, with the optimal actions being up, left, and down; each of these has a 0.1 chance of crashing into the wall for an immediate expected reward of -1.

V2 are the values after the second step of value iteration. Consider the node that is immediately to the left of the +10 rewarding state. Its optimal value is to go to the right; it has a 0.7 chance of getting a reward of 10 in the following state, so that is worth 9 (10 times the discount of 0.9) to it now. The expected reward for the other possible resulting states is 0. Thus, the value of this state is 0.7*9=6.3.

Consider the node immediately to the right of the +10 rewarding state after the second step of value iteration. The agent’s optimal action in this state is to go left. The value of this state is

𝑃𝑟𝑜𝑏𝑅𝑒𝑤𝑎𝑟𝑑Future Value 0.7*(0+0.9*10)Agent goes left+0.1*(0+0.9*-0.1)Agent goes up+0.1*(-1+0.9*-0.1)Agent goes right+0.1*(0+0.9*-0.1)Agent goes down

which evaluates to 6.173, which is approximated to 6.2 in V2 above.

The +10 reward state has a value less than 10 in V2 because the agent gets flung to one of the corners and these corners look bad at this stage.

After the next step of value iteration, shown on the right-hand side of the figure, the effect of the +10 reward has progressed one more step. In particular, the corners shown get values that indicate a reward in 3 steps.

An applet is available on the book website showing the details of value iteration for this example.

The value iteration algorithm of Figure 9.16 has an array for each stage, but it really only needs to store the current and the previous arrays. It can update one array based on values from the other.

A common refinement of this algorithm is asynchronous value iteration. Rather than sweeping through the states to create a new value function, asynchronous value iteration updates the states one at a time, in any order, and stores the values in a single array. Asynchronous value iteration can store either the Q[s,a] array or the V[s] array. Figure 9.17 shows asynchronous value iteration when the Q array is stored. It converges faster than value iteration and is the basis of some of the algorithms for reinforcement learning. Termination can be difficult to determine if the agent must guarantee a particular error, unless it is careful about how the actions and states are selected. Often, this procedure is run indefinitely as an anytime algorithm where it is always prepared to give its best estimate of the optimal action in a state when asked.

1: procedure Asynchronous_value_iteration(S,A,P,R)
2:      Inputs
3:          S is the set of all states
4:          A is the set of all actions
5:          P is state transition function specifying P(ss,a)
6:          R is a reward function R(s,a)      
7:      Output
8:          π[s] approximately optimal policy
9:          Q[S,A] value function      
10:      Local
11:          real array Q[S,A]
12:          action array π[S]      
13:      assign Q[S,A] arbitrarily
14:      repeat
15:          select a state s
16:          select an action a
17:          Q[s,a]=R(s,a)+γ*sP(ss,a)*maxaQ[s,a]
18:      until termination
19:      for each state s do
20:          π[s]=argmaxaQ[s,a]      
21:      return π,Q
Figure 9.17: Asynchronous value iteration for MDPs

Asynchronous value iteration could also be implemented by storing just the V[s] array. In that case, the algorithm selects a state s and carries out the update:


Although this variant stores less information, it is more difficult to extract the policy. It requires one extra backup to determine which action a results in the maximum value. This can be done using

Example 9.33.

In Example 9.32, the state one step up and one step to the left of the +10 reward state only had its value updated after three value iterations, in which each iteration involved a sweep through all of the states.

In asynchronous value iteration, the +10 reward state can be chosen first. Next, the node to its left can be chosen, and its value will be 0.7*0.9*10=6.3. Next, the node above that node could be chosen, and its value would become 0.7*0.9*6.3=3.969. Note that it has a value that reflects that it is close to a +10 reward after considering 3 states, not 300 states, as does value iteration.