#### 11.3.9.1 SARSA with Linear Function Approximation

You can use a linear function of features to approximate the *Q*-function in SARSA. This algorithm uses the on-policy
method SARSA, because the agent's experiences sample the reward from the policy
the agent is actually following, rather than sampling an optimum
policy.

A number of ways are available to get a feature-based representation of
the *Q*-function. In this section, we
use features of both the state and the action to provide features for
the linear function.

Suppose *F _{1},...,F_{n}* are numerical features of the state and the action.
Thus,

*F*provides the value for the

_{i}(s,a)*i*th feature for state

*s*and action

*a*. These features are typically binary, with domain

*{0,1}*, but they can also be other numerical features. These features will be used to represent the

*Q*-function.

Q_{w}(s,a) = w_{0}+w_{1}F_{1}(s,a) + ...+ w_{n}F_{n}(s,a)

for some tuple of weights,
*w=⟨w _{0},w_{1},...,w_{n}⟩*. Assume that there is an
extra feature

*F*whose value is always 1, so that

_{0}*w*does not have to be a special case.

_{0}**Example 11.14:**In the grid game of Example 11.8, some possible features are the following:

*F*has value 1 if action_{1}(s,a)*a*would most likely take the agent from state*s*into a location where a monster could appear and has value 0 otherwise.*F*has value 1 if action_{2}(s,a)*a*would most likely take the agent into a wall and has value 0 otherwise.*F*has value 1 if step_{3}(s,a)*a*would most likely take the agent toward a prize.*F*has value 1 if the agent is damaged in state_{4}(s,a)*s*and action*a*takes it toward the repair station.*F*has value_{5}(s,a)*1*if the agent is damaged and action*a*would most likely take the agent into a location where a monster could appear and has value 0 otherwise. That is, it is the same as*F*but is only applicable when the agent is damaged._{1}(s,a)*F*has value 1 if the agent is damaged in state_{6}(s,a)*s*and has value 0 otherwise.*F*has value 1 if the agent is not damaged in state_{7}(s,a)*s*and has value 0 otherwise.*F*has value 1 if the agent is damaged and there is a prize ahead in direction_{8}(s,a)*a*.*F*has value 1 if the agent is not damaged and there is a prize ahead in direction_{9}(s,a)*a*.*F*has the value of the_{10}(s,a)*x*-value in state*s*if there is a prize at location*P*in state_{0}*s*. That is, it is the distance from the left wall if there is a prize at location*P*._{0}*F*has the value_{11}(s,a)*4-x*, where*x*is the horizontal position in state*s*if there is a prize at location*P*in state_{0}*s*. That is, it is the distance from the right wall if there is a prize at location*P*._{0}*F*to_{12}(s,a)*F*are like_{29}(s,a)*F*and_{10}*F*for different combinations of the prize location and the distance from each of the four walls. For the case where the prize is at location_{11}*P*, the_{0}*y*-distance could take into account the wall.

An example linear function is

Q(s,a) = 2.0 -1.0*F _{1}(s,a) -0.4* F_{2}(s,a) -1.3*F_{3}(s,a)-0.5*F _{4}(s,a)-1.2 * F_{5}(s,a)-1.6 * F_{6}(s,a)+3.5 * F_{7}(s,a)+0.6*F _{8}(s,a)+0.6 * F_{9}(s,a)-0.0 * F_{10}(s,a)+1.0 * F_{11}(s,a)+....

These are the learned values (to one decimal place) for one run of the algorithm that follows.

An experience in SARSA of the form *⟨s,a,r,s',a'⟩* (the agent was in state *s*, did action *a*, and received reward *r* and
ended up in state *s'*, in which it decided to do action *a'*) provides
the new estimate of *r+γQ(s',a')* to update *Q(s,a)*.
This experience can be used as a data point for **linear regression**. Let
*δ=r+γQ(s',a')-Q(s,a)*. Using
Equation (7.3.2), weight
*w _{i}* is updated by

w_{i}←w_{i}+ηδF_{i}(s,a).

This update can then be incorporated into SARSA, giving the algorithm shown in Figure 11.16.

**controller**

*SARSA-FA*(

*F,γ,η*)

2:

**Inputs**

3:

*F=⟨F*: a set of features

_{1},...,F_{n}⟩4:

*γ∈[0,1]*: discount factor

5:

*η>0*: step size for gradient descent

6:

**Local**

7: weights

*w=⟨w*, initialized arbitrarily

_{0},...,w_{n}⟩8: observe current state

*s*

9: select action

*a*

10:

**repeat**

11: carry out action

*a*

12: observe reward

*r*and state

*s'*

13: select action

*a'*(using a policy based on

*Q*)

_{w}14: let

*δ= r+γQ*

_{w}(s',a')-Q_{w}(s,a)15:

**for**

*i=0*to

*n*

**do**

16:

*w*

_{i}←w_{i}+ ηδF_{i}(s,a)17:

18:

*s ←s'*

19:

*a ←a'*

20:

**until**termination

Selecting an action *a* could be done using an
*ε*-greedy function: with
probability *ε*, an agent selects a random action and otherwise it
selects an action that maximizes *Q _{w}(s,a)*.

Although this program is simple to implement, **feature
engineering** - choosing what features to include - is non-trivial. The
linear function must not only convey the best action to carry out, it
must also convey the information about what future states are useful.

Many variations of this algorithm exist. Different function approximations, such as a neural network or a
decision tree with a linear function at the leaves, could be used. A common variant
is to have a separate function for each action. This is equivalent to
having the *Q*-function approximated by a decision tree that splits on
actions and then has a linear function. It is also possible to split on
other features.

A linear function approximation can also be combined with other methods such as
SARSA*(λ)*, Q-learning, or model-based methods. Note that
some of these methods have different convergence guarantees and different
levels of performance.

**Example 11.15:**On the AIspace web site, there is an open-source implementation of this algorithm for the game of Example 11.8 with the features of Example 11.14. Try stepping through the algorithm for individual steps, trying to understand how each step updates each parameter. Now run it for a number of steps. Consider the performance using the evaluation measures of Section 11.3.5. Try to make sense of the values of the parameters learned.