# 13.5 Exploration and Exploitation

An estimate of a Q-function is not enough to determine what the agent should actually do (which action is selected in line 16 of Figure 13.3). There are two competing goals for what an agent should do:

• It could exploit the knowledge that it has found to get higher rewards by, in state $s$, doing one of the actions $a$ that maximizes $Q[s,a]$.

• It could explore to build a better estimate of the $Q$-function, by, for example, selecting an action at random at each time.

An agent only gets a good estimate of the Q-values for states and actions it has tried many times. If an agent exploits all of the time, it will keep trying the same action in a state, and not explore enough to find an action with lower current $Q$-value that could be better than the action with the current highest $Q$-value.

Random exploration will not allow the agent to exploit what it has found until after it has finished learning, which might never occur. Note that $Q$-learning will converge eventually to the optimal $Q$-function even with random exploration. Random exploration can take an inordinate amount of time to find interesting parts of the state space – the parts of the state space the agent will try to be in – and so even if the aim is to learn then exploit, it might be better to have more purposeful exploration.

There are a number of ways to combine exploitation and exploration:

• In optimism in the face of uncertainty, the $Q$-values are initialized to values that encourage exploration. If the $Q$-values are initialized to high values, the unexplored areas will look good, so that pure exploitation will tend to explore. This does encourage exploration; however, the agent can hallucinate that some state–action pairs are good for a long time, even though there is no real evidence for them being good. A state only gets to look bad when all its actions look bad; but when all of these actions lead to states that look good, it takes a long time to get a realistic view of the actual values. This is a case where old estimates of the $Q$-values can be quite bad estimates of the actual $Q$-value, and these can remain bad estimates for a long time. To get fast convergence, the initial values should be as close as possible to the final values; trying to make them an overestimate will make convergence slower. In noisy environments, where the effect of an action has some randomness, optimism in the face of uncertainty with no other mechanism for exploration can mean that a good action never gets explored more because, by random chance, it gets a low Q-value from which it cannot recover. Optimism in the face of uncertainty can be used in conjunction with other methods.

• In an $\epsilon$-greedy exploration strategy, where $0\leq\epsilon\leq 1$, the agent selects a random action $\epsilon$ of the time, and otherwise an action that maximizes $Q[s,a]$. It is possible to change $\epsilon$ through time. Intuitively, early in the life of the agent, it should act more randomly to encourage initial exploration and, as time progresses, it should act more greedily (reducing $\epsilon$).

• One problem with an $\epsilon$-greedy strategy is that it treats all of the actions, apart from the best action, equivalently. If there are a few seemingly good actions and other actions that look much less promising, it may be more sensible to select among the good actions: putting more effort towards determining which of the promising actions is best, and less effort towards exploring the actions that look worse. One way to do that is to select action $a$ with a probability depending on the value of $Q[s,a]$. This is known as a softmax action selection. A common method is to use a Gibbs or Boltzmann distribution, where the probability of selecting action $a$ in state $s$ is proportional to $e^{{Q[s,a]}/{\tau}}$. Thus, in state $s$, the agent selects action $a$ with probability

 $\frac{e^{{Q[s,a]}/{\tau}}}{\sum_{a}e^{{Q[s,a]}/{\tau}}}$

where $\tau>0$ is the temperature specifying how randomly values should be chosen. How the difference in temperatures affects the probability of being selected is given in Figure 4.1; for example, at a temperature of 1, if action $a_{1}$ has a utility two less than action $a_{2}$, it will be chosen approximately 14% of the time that $a_{1}$ is chosen. When $\tau$ is high, the actions are chosen in almost equal amounts. As the temperature is reduced, the higher-valued actions are more likely to be chosen and, in the limit as $\tau\rightarrow 0$, the best action is always chosen. Often the temperature can be reduced as the search progresses.

• A problem with the previous methods is that they do not distinguish the actions that the agent has tried many times, for which there is a good estimate of the actual $Q$-value, from those actions that have not been tried much, for which the estimate is very poor. Another group of methods models a distribution of the uncertainty of the estimate of the expected values, not just the current expected value. As the agent gets more information, the uncertainty about the expected value reduces, and it can become more sure of the expected value. The upper confidence bound is an upper estimate of the expected value; the aim is to pick a value such that it will be very unlikely that the actual value is greater than this. The upper confidence bound is the sum of two terms, the $Q$ estimate and a confidence bound that depends on the number of times the action has been chosen. Suppose $N[s,a]$ is the number of times action $a$ has been selected for state $s$, and $N(s)=\sum_{a}N[s,a]$ is the number of times state $s$ has been visited. $N(s)$ can be stored or computed on demand. The upper confidence bound UCB1 (where the name is from Auer et al. [2002], who analyze a number of algorithms) is

 $UCB1(s,a)=Q[s,a]+C*\sqrt{\frac{\log N(s)}{N[s,a]}}$

where $C$ is a constant that depends on the magnitude of the $Q$-values; if the values are all in range $[0,1]$ and the samples are independent, then $C=\sqrt{2}$ has good theoretical properties. In reinforcement learning the samples are rarely independent. In a state $s$, the agent can balance exploration and exploitation by choosing action $a$ that has the highest $UCB1(s,a)$ value.

• The successive values of $Q(s,a)$ computed provide more and more refined estimates its the mean value. Those estimates can be used to create a distribution over the mean. An alternative to choosing the upper confidence bound is to sample from this distribution. In Thompson sampling, given a state, for each action, $a$, a value $v_{a}$ is selected from the distribution of the mean; the action with the highest $v_{a}$ is chosen to be carried out.

If the values are all 0 or 1 (for example, if the reward is just a win or loss at the end), the beta distribution, represented by counts of the number of 0s and the number of 1s, is appropriate.

If the return is a real number it is common to assume it is a Gaussian, which is parameterized by the mean and the variance. For each state, choose the action $a$ that maximizes

 $Q[s,a]+C*\frac{randn()}{\sqrt{N[s,a]}}$

where $randn()$ returns a random number using the standard normal distribution (mean 0, variance 1). $C$ is chosen to reflect the scale of the $Q$-values.

Sometimes neither the beta nor normal distributions are appropriate, in which case there is a large literature of other methods that are available.

Much of the theoretical work in the exploration–exploitation tradeoff has been for the single-state case, called the multi-armed bandit problem; a “bandit” is the slang for a slot machine that gives rewards randomly and independently each time. The explore–exploit trade-off in reinforcement learning is more complicated than in the bandit setting, because rewards may be delayed and so the initial estimates of a $Q$-value may be not a random sample, and because the estimates are not independent. This means that we typically have to resort to testing what works in practice; see, for example, Exercise 13.4.