# 12.1 Reinforcement Learning Problem

A reinforcement learning (RL) agent acts in an environment, observing its state and receiving rewards. From its perceptual and reward information, it must determine what to do. This chapter mainly considers fully observable, single-agent reinforcement learning. Section 12.10.2 describes a simple form of multiagent reinforcement learning.

A reinforcement learning agent is characterized as follows:

• The learning agent is given the possible states and the set of actions it can carry out.

• At each time the agent observes the state of the environment and the reward received. We are assuming the environment is fully observable.

• At each time, after observing the state and reward, the agent carries out an action.

• The goal of the agent is to maximize its discounted reward, for some discount factor $\gamma$.

Reinforcement learning can be formalized in terms of Markov decision processes, in which the agent initially only knows the set of possible states and the set of possible actions. The dynamics, $P(s^{\prime}\mid a,s)$, and the reward function, $R(s,a)$, are not given to the agent. As in an MDP, after each action, the agent observes the state it is in and receives a reward.

###### Example 12.1.

Consider the domain shown in Figure 12.1. There are six states the agent could be in, labeled $s_{0},\dots,s_{5}$. The agent can observe what state it is in at any time. The agent has four actions: $upR$, $upC$, $left$, $right$. That is all the agent knows before it starts. It does not know how the states are configured, what the actions do, or how rewards are earned.

Figure 12.1 shows the configuration of the six states. Suppose the actions work as follows:

right

The agent moves to the right in states $s_{0},s_{2},s_{4}$ with a reward of 0 and stays still in the other states, with a reward of $-1$.

left

The agent moves one state to the left in states $s_{1},s_{3},s_{5}$, with a reward of 0. In state $s_{0}$, it stays in state $s_{0}$ and has a reward of $-1$. In state $s_{2}$, it has a reward of $-100$ and stays in state $s_{2}$. In state $s_{4}$, it receives a reward of $10$ and moves to state $s_{0}$.

upC

(for “up carefully”) The agent goes up, except in states $s_{4}$ and $s_{5}$, where the agent crashes and stays still. It receives a reward of $-1$, except when it crashes, in which cases there is a reward of $-2$.

upR

(for “up risky”) With a probability of $0.8$ it acts like $upC$, except the reward is $-1$ when it crashes, and is 0 otherwise. With probability $0.1$ it acts as a $left$, and with probability $0.1$ it acts as a $right$.

There is a discounted reward with a discount of $\gamma=0.9$. This can be translated as having a 0.1 chance of the agent leaving the game at any step, or as a way to encode that the agent prefers immediate rewards over future rewards.

The agent should try to go left from $s_{4}$ as often as possible to collect the reward of $+10$. Getting from $s_{0}$ to $s_{4}$, it can go past a dangerous cliff at $s_{2}$ where there is a risk of falling off the cliff and getting a large negative reward, or going the longer and safer way around. Initially, it does not know this, but this is what it needs to learn.

###### Example 12.2.

Figure 12.2 shows the domain of a more complex game. There are 25 grid locations the agent could be in. A prize could be on one of the corners, or there could be no prize. When the agent lands on a prize, it receives a reward of $10$ and the prize disappears. When there is no prize, for each time step there is a probability that a prize appears on any one of the corners. Monsters can appear at any time on one of the locations marked $M$. The agent gets damaged if a monster appears on the square the agent is on. If the agent is already damaged, it receives a reward of $-10$. The agent can get repaired (so it is no longer damaged) by visiting the repair station marked $R$.

In this example, the state consists of four components: $\left$, where $X$ is the $X$-coordinate of the agent, $Y$ is the $Y$-coordinate of the agent, $D$ is Boolean and is true when the agent is damaged, and $P$ is the position of the prize ($P=0$ if there is no prize, $P=i$ if there is a prize at position $P_{i}$). Because the monsters are transient – knowing whether a monster appeared at a time does not provide any information about the future – it is not necessary to include them as part of the state. There are therefore $5*5*2*5=250$ states. The environment is fully observable, so the agent knows what state it is in. The agent does not know the meaning of the states; it doesn’t know about the four components and it has no idea initially about being damaged or what a prize is.

The agent has four actions: $up$, $down$, $left$, and $right$. These move the agent one step – usually one step in the direction indicated by the name, but sometimes in one of the other directions. If the agent crashes into an outside wall or one of the interior walls (the thick lines near the location $R$), it remains where is was and receives a reward of $-1$.

The agent does not know any of the story given here. It just knows there are 250 states and 4 actions, which state it is in at each time, and what reward was received each time.

This game is simple, but it is surprisingly difficult to write a good controller for it. There are implementations available on the book website that you can play with and modify. Try to write a controller by hand for it; it is possible to write a controller that accumulates a reward of about 500 for each 1000 steps. This game is also difficult to learn, because visiting $R$ is seemingly useless until the agent eventually learns that being damaged is bad, and that visiting $R$ makes it not damaged. It must stumble on this while trying to collect the prizes. The states where there is no prize available do not last very long. Moreover, it has to learn this without being given the concept of damaged; all it knows, initially, is that there are 250 states and 4 actions.

Reinforcement learning is difficult for a number of reasons:

• The credit assignment problem or the blame attribution problem is the problem of determining which action was responsible for a reward or punishment. The action responsible may have occurred a long time before the reward was received. Moreover, not just a single action but rather a combination of actions carried out in the appropriate circumstances may be responsible for a reward. For example, you could teach an agent to play a game by rewarding it when it wins or loses; it must determine the brilliant moves, which usually occur long before the end, that were needed to win. As another example, you may try to train a dog by saying “bad dog” when you come home and find a mess. The dog has to determine, out of all of the actions it did, which of them were the actions that were responsible for the reprimand.

• Even if the dynamics of the world does not change, the effect of an action of the agent depends on what the agent will do in the future. What may initially seem like a bad thing for the agent to do may end up being an optimal action because of what the agent does in the future. This is common among planning problems, but it is complicated in the reinforcement learning context because the agent does not know, a priori, the effects of its actions.

• The explore–exploit dilemma: if an agent has worked out a good course of actions, should it continue to follow these actions (exploiting what it has determined) or should it explore to find better actions? An agent that never explores may act forever in a way that could have been much better if it had explored earlier. An agent that always explores will never use what it has learned. This dilemma is discussed further in Section 12.5.