9.5 Decision Processes 9.5.2 Value Iteration 9.5.4 Dynamic Decision Networks

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

9.5.3 Policy Iteration

Policy iteration starts with a policy and iteratively improves it. It starts with an arbitrary policy $\pi_{0}$ (an approximation to the optimal policy works best) and carries out the following steps starting from $i=0\,$ :

•

Policy evaluation: determine $V^{\pi_{i}}(S)$ . The definition of $V^{\pi}$ is a set of $|S|$ linear equations in $|S|$ unknowns. The unknowns are the values of $V^{\pi_{i}}(S)$ . There is an equation for each state. These equations can be solved by a linear equation solution method (such as Gaussian elimination) or they can be solved iteratively.
•

Policy improvement: choose $\pi_{i+1}(s)=\arg\max_{a}Q^{\pi_{i}}(s,a)$ , where the $Q$ -value can be obtained from $V$ using Equation 9.2. To detect when the algorithm has converged, it should only change the policy if the new action for some state improves the expected value; that is, it should set $\pi_{i+1}(s)$ to be $\pi_{i}(s)$ if $\pi_{i}(s)$ is one of the actions that maximizes $Q^{\pi_{i}}(s,a)$ .
•

Stop if there is no change in the policy, if $\pi_{i+1}=\pi_{i}$ , otherwise increment $i$ and repeat.

1: procedure Policy_iteration(

S, A, P, R

)

2: Inputs

S

is the set of all states

A

is the set of all actions

P

is state transition function specifying

P(s^{\prime}\mid s,a)

R

is a reward function

R(s,a)

7: Output

8: optimal policy

\pi

9: Local

10: action array

\pi[S]

11: Boolean variable

n o C h a n g e

12: real array

V[S]

13: set

\pi

arbitrarily

14: repeat

15:

noChange\;{:}{=}\;\mbox{}true

16: Solve

V[s]=R(s,a)+\gamma*\sum_{s^{\prime}\in S}P(s^{\prime}\mid s,\pi[s])*V[s^{% \prime}]

17: for each

s\in S

18:

QBest\;{:}{=}\;\mbox{}V[s]

19: for each

a\in A

20:

Qsa\;{:}{=}\;\mbox{}R(s,a)+\gamma*\sum_{s^{\prime}\in S}P(s^{\prime}\mid s,a)*% V[s^{\prime}]

21: if

Qsa>QBest

then

22:

\pi[s]\;{:}{=}\;\mbox{}a

23:

QBest\;{:}{=}\;\mbox{}Qsa

24:

noChange\;{:}{=}\;\mbox{}false

25: until

n o C h a n g e

26: return

\pi

Figure 9.18: Policy iteration for MDPs

The algorithm is shown in Figure 9.18. Note that it only keeps the latest policy and notices if it has changed. This algorithm always halts, usually in a small number of iterations. Unfortunately, solving the set of linear equations is often time consuming.

A variant of policy iteration, called modified policy iteration, is obtained by noticing that the agent is not required to evaluate the policy to improve it; it can just carry out a number of backup steps using Equation 9.2 and then do an improvement.

Policy iteration is useful for systems that are too big to be represented directly as MDPs. Suppose a controller has some parameters that can be varied. An estimate of the derivative of the cumulative discounted reward of a parameter $a$ in some context $s$ , which corresponds to the derivative of $Q(a,s)$ , can be used to improve the parameter. Such an iteratively improving controller can get into a local maximum that is not a global maximum. Policy iteration for state-based MDPs does not result in non-optimal local maxima, because it is possible to improve an action for a state without affecting other states, whereas updating parameters can affect many states at once.

Artificial Intelligence 2E

9.5.3 Policy Iteration

Artificial
Intelligence 2E