### 9.5.4 Policy Iteration

**Policy iteration** starts with a policy and iteratively
improves it. It starts with an arbitrary policy *π _{0}* (an approximation to the
optimal policy works best) and carries out the following steps
starting from

*i=0*.

- Policy evaluation: determine
*V*. The definition of^{πi}(S)*V*is a set of^{π}*|S|*linear equations in*|S|*unknowns. The unknowns are the values of*V*. There is an equation for each state. These equations can be solved by a linear equation solution method (such as Gaussian elimination) or they can be solved iteratively.^{πi}(S) - Policy improvement: choose
*π*, where the_{i+1}(s)= argmax_{a}Q^{πi}(s,a)*Q*-value can be obtained from*V*using Equation (9.5.1). To detect when the algorithm has converged, it should only change the policy if the new action for some state improves the expected value; that is, it should set*π*to be_{i+1}(s)*π*if_{i}(s)*π*is one of the actions that maximizes_{i}(s)*Q*.^{πi}(s,a) - Stop if there is no change in the policy - that is, if
*π*- otherwise increment_{i+1}=π_{i}*i*and repeat.

**Procedure**Policy_Iteration(

*S,A,P,R*)

2:

**Inputs**

3:

*S*is the set of all states

4:

*A*is the set of all actions

5:

*P*is state transition function specifying

*P(s'|s,a)*

6:

*R*is a reward function

*R(s,a,s')*

7:

**Output**

8: optimal policy

*π*

9:

**Local**

10: action array

*π[S]*

11: Boolean variable

*noChange*

12: real array

*V[S]*

13: set

*π*arbitrarily

14:

**repeat**

15:

*noChange ←true*

16: Solve

*V[s] = ∑*

_{s'∈S}P(s'|s,π[s])(R(s,a,s')+γV[s'])17:

**for each**

*s∈S*

**do**

18: Let

*QBest=V[s]*

19:

**for each**

*a ∈A*

**do**

20: Let

*Qsa=∑*

_{s'∈S}P(s'|s,a)(R(s,a,s')+γV[s'])21:

**if**(

*Qsa > QBest*)

**then**

22:

*π[s]←a*

23:

*QBest ←Qsa*

24:

*noChange ←false*

25:

**until**

*noChange*

26:

**return**

*π*

The algorithm is shown in Figure 9.16. Note that it only keeps the latest policy and notices if it has changed. This algorithm always halts, usually in a small number of iterations. Unfortunately, solving the set of linear equations is often time consuming.

A variant of policy iteration, called **modified policy iteration**,
is obtained by noticing that the agent is not required to evaluate the policy to improve it; it can just carry out a number of backup steps
[using Equation (9.5.1)] and then do an
improvement.

The idea behind policy iteration is also useful for systems that are too big to
be represented directly as MDPs. Suppose a controller has
some parameters that can be varied. An estimate of the derivative of
the cumulative discounted reward of a
parameter *a* in some context *s*, which corresponds to
the derivative of *Q(a,s)*, can be used to improve the parameter. Such an iteratively improving controller can
get into a local maximum that is not a global maximum. Policy iteration
for MDPs does not result in non-optimal local maxima, because it is possible
to improve an action for a state without affecting other
states, whereas updating parameters can affect many states at once.