Artificial Intelligence - foundations of computational agents -- 9.5.1 Value of a Policy

Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

9.5.1 Value of a Policy

The expected value of a policy π for the discounted reward, with discount γ, is defined in terms of two interrelated functions, V^π and Q^π.

Let Q^π(s,a), where s is a state and a is an action, be the expected value of doing a in state s and then following policy π. Recall that V^π(s), where s is a state, is the expected value of following policy π in state s.

Q^π and V^π can be defined recursively in terms of each other. If the agent is in state s, performs action a, and arrives in state s', it gets the immediate reward of R(s,a,s') plus the discounted future reward, γV^π(s'). When the agent is planning it does not know the actual resulting state, so it uses the expected value, averaged over the possible resulting states:

Q^π(s,a) = ∑_s' P(s'|s,a) (R(s,a,s')+ γV^π(s')).

V^π(s) is obtained by doing the action specified by π and then acting following π:

V^π(s) = Q^π(s,π(s)).