### 11.3.1 Evolutionary Algorithms

One way to solve reinforcement algorithms is to treat this as an
optimization problem, with the aim of selecting a
policy that maximizes the expected reward collected. One way to do
this via **policy search**. The aim is to search through the
space of all policies to find the best policy. A policy is a
controller that can be evaluated by running
it in the agent acting in the environment.

Policy search is often solved as a stochastic local search algorithm by searching in the space of policies. A policy can be evaluated by running it in the environment a number of times.

One of the difficulties is in choosing a representation of the
policy. Starting
from an initial policy, the policy can be repeatedly evaluated in the
environment and iteratively improved. This process is called an
**evolutionary algorithm** because the agent, as a whole, is evaluated
on how well it survives. This is often combined with genetic
algorithms, which take us one step closer to the
biological analogy of competing agents mutating genes. The idea is that
crossover provides a way to combine the best features of policies.

Evolutionary algorithms have a number of issues. The first
is the size of the state space. If there are *n* states and *m*
actions, there are *m ^{n}* policies. For example, for the game described in
Example 11.7, there are

*4*different policies. For the game of Example 11.8, there are 250 states, and so

^{6}=4,096*4*policies. This is a very small game, but it has more policies than there are particles in the universe.

^{250}approx 10^{150}Second, evolutionary algorithms use experiences very wastefully. If
an agent was in state *s _{2}* of Example 11.7 and it
moved left, you would like it to learn that it is bad to go left from
state

*s*. But evolutionary algorithms wait until the agent has finished and judge the policy as a whole. Stochastic local search will randomly try doing something else in state

_{2}*s*and so may eventually determine that that action was not good, but it is very indirect. Genetic algorithms are slightly better in that the policies that have the agent going left in state

_{2}*s*will die off, but again this is very indirect.

_{2}Third, the performance of evolutionary algorithms can be very sensitive to the representation of the policy. The representation for a genetic algorithm should be such that crossover preserves the good parts of the policy. The representations are often tuned for the particular domain.

An alternative pursued in the rest of this chapter is to learn after every action. The components of the policy are learned, rather than the policy as a whole. By learning what do in each state, we can make the problem linear in the number of states rather than exponential in the number of states.