Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

11.3.9.1 SARSA with Linear Function Approximation

You can use a linear function of features to approximate the Q-function in SARSA. This algorithm uses the on-policy method SARSA, because the agent's experiences sample the reward from the policy the agent is actually following, rather than sampling an optimum policy.

A number of ways are available to get a feature-based representation of the Q-function. In this section, we use features of both the state and the action to provide features for the linear function.

Suppose F1,...,Fn are numerical features of the state and the action. Thus, Fi(s,a) provides the value for the ith feature for state s and action a. These features are typically binary, with domain {0,1}, but they can also be other numerical features. These features will be used to represent the Q-function.

Qw(s,a) = w0+w1 F1(s,a) + ...+ wn Fn(s,a)

for some tuple of weights, w=⟨w0,w1,...,wn. Assume that there is an extra feature F0 whose value is always 1, so that w0 does not have to be a special case.

Example 11.14: In the grid game of Example 11.8, some possible features are the following:
  • F1(s,a) has value 1 if action a would most likely take the agent from state s into a location where a monster could appear and has value 0 otherwise.
  • F2(s,a) has value 1 if action a would most likely take the agent into a wall and has value 0 otherwise.
  • F3(s,a) has value 1 if step a would most likely take the agent toward a prize.
  • F4(s,a) has value 1 if the agent is damaged in state s and action a takes it toward the repair station.
  • F5(s,a) has value 1 if the agent is damaged and action a would most likely take the agent into a location where a monster could appear and has value 0 otherwise. That is, it is the same as F1(s,a) but is only applicable when the agent is damaged.
  • F6(s,a) has value 1 if the agent is damaged in state s and has value 0 otherwise.
  • F7(s,a) has value 1 if the agent is not damaged in state s and has value 0 otherwise.
  • F8(s,a) has value 1 if the agent is damaged and there is a prize ahead in direction a.
  • F9(s,a) has value 1 if the agent is not damaged and there is a prize ahead in direction a.
  • F10(s,a) has the value of the x-value in state s if there is a prize at location P0 in state s. That is, it is the distance from the left wall if there is a prize at location P0.
  • F11(s,a) has the value 4-x, where x is the horizontal position in state s if there is a prize at location P0 in state s. That is, it is the distance from the right wall if there is a prize at location P0.
  • F12(s,a) to F29(s,a) are like F10 and F11 for different combinations of the prize location and the distance from each of the four walls. For the case where the prize is at location P0, the y-distance could take into account the wall.

An example linear function is

Q(s,a)
= 2.0 -1.0*F1(s,a) -0.4* F2(s,a) -1.3*F3(s,a)
-0.5*F4(s,a)-1.2 * F5(s,a)-1.6 * F6(s,a)+3.5 * F7(s,a)
+0.6*F8(s,a)+0.6 * F9(s,a)-0.0 * F10(s,a)+1.0 * F11(s,a)+....

These are the learned values (to one decimal place) for one run of the algorithm that follows.

An experience in SARSA of the form ⟨s,a,r,s',a'⟩ (the agent was in state s, did action a, and received reward r and ended up in state s', in which it decided to do action a') provides the new estimate of r+γQ(s',a') to update Q(s,a). This experience can be used as a data point for linear regression. Let δ=r+γQ(s',a')-Q(s,a). Using Equation (7.3.2), weight wi is updated by

wi←wi+ηδFi(s,a).

This update can then be incorporated into SARSA, giving the algorithm shown in Figure 11.16.


controller SARSA-FA(F,γ,η)
2:           Inputs
3:                     F=⟨F1,...,Fn: a set of features
4:                     γ∈[0,1]: discount factor
5:                     η>0: step size for gradient descent
6:           Local
7:                     weights w=⟨w0,...,wn, initialized arbitrarily
8:           observe current state s
9:           select action a
10:           repeat
11:                     carry out action a
12:                     observe reward r and state s'
13:                     select action a' (using a policy based on Qw)
14:                     let δ= r+γQw(s',a')-Qw(s,a)
15:                     for i=0 to n do
16:                               wi ←wi + ηδFi(s,a)
17:                     
18:                     s ←s'
19:                     a ←a'
20:           until termination
Figure 11.16: SARSA with linear function approximation

Selecting an action a could be done using an ε-greedy function: with probability ε, an agent selects a random action and otherwise it selects an action that maximizes Qw(s,a).

Although this program is simple to implement, feature engineering - choosing what features to include - is non-trivial. The linear function must not only convey the best action to carry out, it must also convey the information about what future states are useful.

Many variations of this algorithm exist. Different function approximations, such as a neural network or a decision tree with a linear function at the leaves, could be used. A common variant is to have a separate function for each action. This is equivalent to having the Q-function approximated by a decision tree that splits on actions and then has a linear function. It is also possible to split on other features.

A linear function approximation can also be combined with other methods such as SARSA(λ), Q-learning, or model-based methods. Note that some of these methods have different convergence guarantees and different levels of performance.

Example 11.15: On the AIspace web site, there is an open-source implementation of this algorithm for the game of Example 11.8 with the features of Example 11.14. Try stepping through the algorithm for individual steps, trying to understand how each step updates each parameter. Now run it for a number of steps. Consider the performance using the evaluation measures of Section 11.3.5. Try to make sense of the values of the parameters learned.