# 9.5.3 Policy Iteration

Policy iteration starts with a policy and iteratively improves it. It starts with an arbitrary policy $\pi_{0}$ (an approximation to the optimal policy works best) and carries out the following steps starting from $i=0\,$:

• Policy evaluation: determine $V^{\pi_{i}}(S)$. The definition of $V^{\pi}$ is a set of $|S|$ linear equations in $|S|$ unknowns. The unknowns are the values of $V^{\pi_{i}}(S)$. There is an equation for each state. These equations can be solved by a linear equation solution method (such as Gaussian elimination) or they can be solved iteratively.

• Policy improvement: choose $\pi_{i+1}(s)=\arg\max_{a}Q^{\pi_{i}}(s,a)$, where the $Q$-value can be obtained from $V$ using Equation 9.2. To detect when the algorithm has converged, it should only change the policy if the new action for some state improves the expected value; that is, it should set $\pi_{i+1}(s)$ to be $\pi_{i}(s)$ if $\pi_{i}(s)$ is one of the actions that maximizes $Q^{\pi_{i}}(s,a)$.

• Stop if there is no change in the policy, if $\pi_{i+1}=\pi_{i}$, otherwise increment $i$ and repeat.

The algorithm is shown in Figure 9.18. Note that it only keeps the latest policy and notices if it has changed. This algorithm always halts, usually in a small number of iterations. Unfortunately, solving the set of linear equations is often time consuming.

A variant of policy iteration, called modified policy iteration, is obtained by noticing that the agent is not required to evaluate the policy to improve it; it can just carry out a number of backup steps using Equation 9.2 and then do an improvement.

Policy iteration is useful for systems that are too big to be represented directly as MDPs. Suppose a controller has some parameters that can be varied. An estimate of the derivative of the cumulative discounted reward of a parameter $a$ in some context $s$, which corresponds to the derivative of $Q(a,s)$, can be used to improve the parameter. Such an iteratively improving controller can get into a local maximum that is not a global maximum. Policy iteration for state-based MDPs does not result in non-optimal local maxima, because it is possible to improve an action for a state without affecting other states, whereas updating parameters can affect many states at once.