### 11.3.1 Evolutionary Algorithms

One way to solve reinforcement algorithms is to treat this as an optimization problem, with the aim of selecting a policy that maximizes the expected reward collected. One way to do this via policy search. The aim is to search through the space of all policies to find the best policy. A policy is a controller that can be evaluated by running it in the agent acting in the environment.

Policy search is often solved as a stochastic local search algorithm by searching in the space of policies. A policy can be evaluated by running it in the environment a number of times.

One of the difficulties is in choosing a representation of the policy. Starting from an initial policy, the policy can be repeatedly evaluated in the environment and iteratively improved. This process is called an evolutionary algorithm because the agent, as a whole, is evaluated on how well it survives. This is often combined with genetic algorithms, which take us one step closer to the biological analogy of competing agents mutating genes. The idea is that crossover provides a way to combine the best features of policies.

Evolutionary algorithms have a number of issues. The first is the size of the state space. If there are n states and m actions, there are mn policies. For example, for the game described in Example 11.7, there are 46=4,096 different policies. For the game of Example 11.8, there are 250 states, and so 4250 approx 10150 policies. This is a very small game, but it has more policies than there are particles in the universe.

Second, evolutionary algorithms use experiences very wastefully. If an agent was in state s2 of Example 11.7 and it moved left, you would like it to learn that it is bad to go left from state s2. But evolutionary algorithms wait until the agent has finished and judge the policy as a whole. Stochastic local search will randomly try doing something else in state s2 and so may eventually determine that that action was not good, but it is very indirect. Genetic algorithms are slightly better in that the policies that have the agent going left in state s2 will die off, but again this is very indirect.

Third, the performance of evolutionary algorithms can be very sensitive to the representation of the policy. The representation for a genetic algorithm should be such that crossover preserves the good parts of the policy. The representations are often tuned for the particular domain.

An alternative pursued in the rest of this chapter is to learn after every action. The components of the policy are learned, rather than the policy as a whole. By learning what do in each state, we can make the problem linear in the number of states rather than exponential in the number of states.