foundations of computational agents
The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).
Policy iteration starts with a policy and iteratively improves it. It starts with an arbitrary policy (an approximation to the optimal policy works best) and carries out the following steps starting from :
Policy evaluation: determine . The definition of is a set of linear equations in unknowns. The unknowns are the values of . There is an equation for each state. These equations can be solved by a linear equation solution method (such as Gaussian elimination) or they can be solved iteratively.
Policy improvement: choose , where the -value can be obtained from using Equation 9.2. To detect when the algorithm has converged, it should only change the policy if the new action for some state improves the expected value; that is, it should set to be if is one of the actions that maximizes .
Stop if there is no change in the policy, if , otherwise increment and repeat.
The algorithm is shown in Figure 9.18. Note that it only keeps the latest policy and notices if it has changed. This algorithm always halts, usually in a small number of iterations. Unfortunately, solving the set of linear equations is often time consuming.
A variant of policy iteration, called modified policy iteration, is obtained by noticing that the agent is not required to evaluate the policy to improve it; it can just carry out a number of backup steps using Equation 9.2 and then do an improvement.
Policy iteration is useful for systems that are too big to be represented directly as MDPs. Suppose a controller has some parameters that can be varied. An estimate of the derivative of the cumulative discounted reward of a parameter in some context , which corresponds to the derivative of , can be used to improve the parameter. Such an iteratively improving controller can get into a local maximum that is not a global maximum. Policy iteration for state-based MDPs does not result in non-optimal local maxima, because it is possible to improve an action for a state without affecting other states, whereas updating parameters can affect many states at once.