Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

9.5.2 Value of an Optimal Policy

Let Q*(s,a), where s is a state and a is an action, be the expected value of doing a in state s and then following the optimal policy. Let V*(s), where s is a state, be the expected value of following an optimal policy from state s.

Q* can be defined analogously to Qπ:

Q*(s,a) = ∑s' P(s'|s,a) (R(s,a,s')+ γV*(s')).

V*(s) is obtained by performing the action that gives the best value in each state:

V*(s) =maxa Q*(s,a).

An optimal policy π* is one of the policies that gives the best value for each state:

π*(s) = argmaxa Q*(s,a).

Note that argmaxa Q*(s,a) is a function of state s, and its value is one of the a's that results in the maximum value of Q*(s,a).