9.5.1 Value of a Policy

The expected value of a policy π for the discounted reward, with discount γ, is defined in terms of two interrelated functions, Vπ and Qπ.

Let Qπ(s,a), where s is a state and a is an action, be the expected value of doing a in state s and then following policy π. Recall that Vπ(s), where s is a state, is the expected value of following policy π in state s.

Qπ and Vπ can be defined recursively in terms of each other. If the agent is in state s, performs action a, and arrives in state s', it gets the immediate reward of R(s,a,s') plus the discounted future reward, γVπ(s'). When the agent is planning it does not know the actual resulting state, so it uses the expected value, averaged over the possible resulting states:

Qπ(s,a) = ∑s' P(s'|s,a) (R(s,a,s')+ γVπ(s')).

Vπ(s) is obtained by doing the action specified by π and then acting following π:

Vπ(s) = Qπ(s,π(s)).