### 9.5.1 Value of a Policy

The expected value of a policy *π* for the discounted
reward, with discount *γ*, is defined in terms of two
interrelated functions, *V ^{π}* and

*Q*.

^{π}Let *Q ^{π}(s,a)*, where

*s*is a state and

*a*is an action, be the expected value of doing

*a*in state

*s*and then following policy

*π*. Recall that

*V*, where

^{π}(s)*s*is a state, is the expected value of following policy

*π*in state

*s*.

*Q ^{π}* and

*V*can be defined recursively in terms of each other. If the agent is in state

^{π}*s*, performs action

*a*, and arrives in state

*s'*, it gets the immediate reward of

*R(s,a,s')*plus the discounted future reward,

*γV*. When the agent is planning it does not know the actual resulting state, so it uses the expected value, averaged over the possible resulting states:

^{π}(s')

Q ^{π}(s,a)= ∑ _{s'}P(s'|s,a) (R(s,a,s')+ γV^{π}(s')).

*V ^{π}(s)* is obtained by doing the action specified by

*π*and then acting following

*π*:

V ^{π}(s)= Q ^{π}(s,π(s)).