13.4 Learning from Experiences

In reinforcement learning, an agent tries to learn the optimal policy from its history of interaction with the environment. A history of an agent is a sequence of state–action–rewards:

s0,a0,r1,s1,a1,r2,s2,a2,r3,s3,a3,r4,s4

which means that the agent was in state s0 and did action a0, which resulted in it receiving reward r1 and being in state s1; then it did action a1, received reward r2, and ended up in state s2; then it did action a2, received reward r3, and ended up in state s3; and so on.

We treat this history of interaction as a sequence of experiences, where an experience is a tuple

s,a,r,s

which means that the agent was in state s, it did action a, it received reward r, and it went into state s. These experiences will be the data from which the agent can learn what to do. As in decision-theoretic planning, the aim is for the agent to maximize its value, which is usually the discounted reward.

13.4.1 Q-learning

Recall that Q(s,a), where a is an action and s is a state, is the expected value (cumulative discounted reward) of doing a in state s and then following the optimal policy.

Q-learning uses temporal differences to estimate the value of Q(s,a). In Q-learning, the agent maintains a table of Q[S,A], where S is the set of states and A is the set of actions. Q[s,a] represents its current estimate of Q(s,a).

An experience s,a,r,s provides one data point for the value of Q(s,a). The data point is that the agent received the future value of r+γV(s), where V(s)=maxaQ(s,a); this is the actual current reward plus the discounted estimated future value. This new data point is called a return. The agent can use the temporal difference equation (13.1) to update its estimate for Q(s,a):

Q[s,a]:=Q[s,a]+α(r+γmaxaQ[s,a]Q[s,a])

or, equivalently:

Q[s,a]:=(1α)Q[s,a]+α(r+γmaxaQ[s,a]).

Figure 13.3 shows a Q-learning controller, where the agent is acting and learning at the same time. The do command, do(a), on line 17 specifies that the action a is the command the controller sends to the body. The reward and the resulting state are the percepts the controller receives from the body.

1: controller Q-learning(S,A,γ,alpha_fun)
2:      Inputs
3:          S is a set of states
4:          A is a set of actions
5:          γ the discount
6:          alpha_fun is a function to compute step size from counts      
7:      Local
8:          real array Q[S,A]
9:          integer array N[S,A]
10:          states s, s
11:          action a      
12:      initialize Q[S,A] arbitrarily
13:      initialize N[S,A] to 0
14:      observe current state s
15:      repeat
16:          select an action a
17:          do(a)
18:          N[s,a]:=N[s,a]+1
19:          α:=alpha_fun(N[s,a])
20:          observe reward r and state s
21:          Q[s,a]:=Q[s,a]+α(r+γmaxaQ[s,a]Q[s,a])
22:          s:=s
23:      until termination
Figure 13.3: Q-learning controller

The algorithm of Figure 13.3 also maintains an array N[s,a], which counts the number of times action a was performed in state s. The function alpha_fun computes α from the count. alpha_fun(c)=10/(9+c) often works well; see Exercise 13.6. When α is fixed, the N array does not need to be maintained (but it is also used for some exploration strategies; see below).

The Q-learner learns (an approximation of) the optimal Q-function as long as the agent explores enough, and there is no bound on the number of times it tries an action in any state (i.e., it does not always do the same subset of actions in a state).

Example 13.3.

Consider the two-state MDP of Example 12.29. The agent knows there are two states {healthy,sick} and two actions {relax,party}. It does not know the model and it learns from the s,a,r,s experiences. With a discount, γ=0.8, α=0.3, and Q initially 0, the following is a possible trace (to a few significant digits and with the states and actions abbreviated):

s a r s Update=(1α)Q[s,a]+α(r+γmaxaQ[s,a])
he re 7 he Q[he,re]=0.70+0.3(7+0.80)=2.1
he re 7 he Q[he,re]=0.72.1+0.3(7+0.82.1)=4.07
he pa 10 he Q[he,pa]=0.70+0.3(10+0.84.07)=3.98
he pa 10 si Q[he,pa]=0.73.98+0.3(10+0.80)=5.79
si pa 2 si Q[si,pa]=0.70+0.3(2+0.80)=0.06
si re 0 si Q[si,re]=0.70+0.3(0+0.80.06)=0.014
si re 0 he Q[si,re]=0.70.014+0.3(0+0.85.79)=1.40

With α fixed, the Q-values will approximate, but not converge to, the values obtained with value iteration in Example 12.33. The smaller α is, the closer it will converge to the actual Q-values, but the slower it will converge.