Artificial Intelligence - foundations of computational agents -- 9.5.2 Value of an Optimal Policy

Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

9.5.2 Value of an Optimal Policy

Let Q^*(s,a), where s is a state and a is an action, be the expected value of doing a in state s and then following the optimal policy. Let V^*(s), where s is a state, be the expected value of following an optimal policy from state s.

Q^* can be defined analogously to Q^π:

Q^*(s,a) = ∑_s' P(s'|s,a) (R(s,a,s')+ γV^*(s')).

V^*(s) is obtained by performing the action that gives the best value in each state:

V^*(s) =max_a Q^*(s,a).

An optimal policy π^* is one of the policies that gives the best value for each state:

π^*(s) = argmax_a Q^*(s,a).

Note that argmax_a Q^*(s,a) is a function of state s, and its value is one of the a's that results in the maximum value of Q^*(s,a).