# 12.9.1 SARSA with Linear Function Approximation

SARSA with Linear Function Approximation, SARSA_LFA, uses a linear function of features to approximate the $Q$-function. This algorithm uses the on-policy method SARSA, because the agent’s experiences sample the reward from the policy the agent is actually following, rather than sampling from an optimum policy.

SARSA_LFA uses features of both the state and the action. Suppose $F_{1},\dots,F_{n}$ are numerical features of the state and the action. Thus, $F_{i}(s,a)$ provides the value for the $i$th feature for state $s$ and action $a$. These features can be binary, with domain $\{0,1\}$, or other numerical features. These features will be used to represent the linear $Q$-function:

 $Q_{\overline{w}}(s,a)=w_{0}+w_{1}F_{1}(s,a)+\dots+w_{n}F_{n}(s,a)$

for some tuple of weights, $\overline{w}=\left$ that have to be learned. Assume that there is an extra feature $F_{0}(s,a)$ whose value is always 1, so that $w_{0}$ is not a special case.

###### Example 12.6.

Consider the grid game of Example 12.2. From understanding the domain, and not just treating it as a black box, some possible features that can be computed and might be useful are:

• $F_{1}(s,a)$ has value 1 if action $a$ would most likely take the agent from state $s$ into a location where a monster could appear and has value 0 otherwise.

• $F_{2}(s,a)$ has value 1 if action $a$ would most likely take the agent into a wall and has value 0 otherwise.

• $F_{3}(s,a)$ has value 1 if step $a$ would most likely take the agent toward a prize.

• $F_{4}(s,a)$ has value 1 if the agent is damaged in state $s$ and action $a$ takes it toward the repair station.

• $F_{5}(s,a)$ has value $1$ if the agent is damaged and action $a$ would most likely take the agent into a location where a monster could appear and has value 0 otherwise. That is, it is the same as $F_{1}(s,a)$ but is only applicable when the agent is damaged.

• $F_{6}(s,a)$ has value 1 if the agent is damaged in state $s$ and has value 0 otherwise.

• $F_{7}(s,a)$ has value 1 if the agent is not damaged in state $s$ and has value 0 otherwise.

• $F_{8}(s,a)$ has value 1 if the agent is damaged and there is a prize ahead in direction $a$.

• $F_{9}(s,a)$ has value 1 if the agent is not damaged and there is a prize ahead in direction $a$.

• $F_{10}(s,a)$ has the value of the $x$-value in state $s$ if there is a prize at location $P_{0}$ in state $s$. That is, it is the distance from the left wall if there is a prize at location $P_{0}$.

• $F_{11}(s,a)$ has the value $4-x$, where $x$ is the horizontal position in state $s$ if there is a prize at location $P_{0}$ in state $s$. That is, it is the distance from the right wall if there is a prize at location $P_{0}$.

• $F_{12}(s,a)$ to $F_{29}(s,a)$ are like $F_{10}$ and $F_{11}$ for different combinations of the prize location and the distance from each of the four walls. For the case where the prize is at location $P_{0}$, the $y$-distance could take into account the wall.

An example linear function is

 $\displaystyle{Q(s,a)}$ $\displaystyle=$ $\displaystyle 2.0-1.0*F_{1}(s,a)-0.4*F_{2}(s,a)-1.3*F_{3}(s,a)$ $\displaystyle\mbox{}-0.5*F_{4}(s,a)-1.2*F_{5}(s,a)-1.6*F_{6}(s,a)+3.5*F_{7}(s,a)$ $\displaystyle\mbox{}+0.6*F_{8}(s,a)+0.6*F_{9}(s,a)-0.0*F_{10}(s,a)+1.0*F_{11}(% s,a)+\cdots\,.$

These are the learned values (to one decimal place) for one run of $SARSA\_LFA$ algorithm in Figure 12.7.

An experience in SARSA of the form $\left$ (the agent was in state $s$, did action $a$, and received reward $r$ and ended up in state $s^{\prime}$, in which it decided to do action $a^{\prime}$) provides the new estimate of $r+\gamma Q(s^{\prime},a^{\prime})$ to update $Q(s,a)$. This experience can be used as a data point for linear regression. Let $\delta=r+\gamma Q(s^{\prime},a^{\prime})-Q(s,a)$. Using Equation 7.3, weight $w_{i}$ is updated by

 $w_{i}\;{:}{=}\;\mbox{}w_{i}+\eta*\delta*F_{i}(s,a).$

This update can then be incorporated into SARSA, giving the algorithm shown in Figure 12.7.

Although this program is simple to implement, feature engineering – choosing what features to include – is non-trivial. The linear function must not only convey the best action to carry out, it must also convey the information about what future states are useful.

###### Example 12.7.

On the AIspace website, there is an open-source implementation of this algorithm for the game of Example 12.2 with the features of Example 12.6. Try stepping through the algorithm for individual steps, trying to understand how each step updates each parameter. Now run it for a number of steps. Consider the performance using the evaluation measures of Section 12.6. Try to make sense of the values of the parameters learned.

Many variations of this algorithm exist:

• This algorithm tends to overfit to current experiences, and to forget about old experiences, so that when it goes back to a part of the state-space it has not visited recently, it will have to relearn. One modification is to remember old experiences ($\left$ tuples) and to carry out some steps of action replay, by doing some weight updates based on random previous experiences. Updating the weights requires the use of the next action $a^{\prime}$, which should be chosen according to the current policy, not the policy that was under effect when the experience occurred. If memory becomes an issue, some of the old experiences can be discarded.

• Different function approximations, such as a decision tree with a linear function at the leaves, could be used.

• A common variant is to have a separate function for each action. This is equivalent to the $Q$-function approximated by a decision tree that splits on actions and then has a linear function. It is also possible to split on other features.

• A linear function approximation can also be combined with other methods such as Q-learning, or model-based methods.

• In deep reinforcement learning, a deep learner is used instead of the linear function approximation. This means that the features do not need to be engineered, but can be learned. Deep learning requires a large amount of data, and many iterations to learn, and can be sensitive to the architecture provided. A way to handle overfitting, such a regularization, is also required.