foundations of computational agents
One way to solve reinforcement algorithms is to treat this as an optimization problem, with the aim of selecting a policy that maximizes the expected discounted reward. This can be done with a policy search through the space of policies to find the best policy. A policy is a controller that can be evaluated by running it in the agent acting in the environment.
Policy search is often solved as a stochastic local search algorithm by searching in the space of policies. A policy can be evaluated by running it in the environment a number of times.
A diverse set of initial policies can be repeatedly evaluated in the environment and iteratively improved. This process is called an evolutionary algorithm because each policy, considered as an agent, is evaluated on how well it survives. This is often combined with genetic algorithms, which allows the combination of components of the policies with the aim of having a diverse collection of controllers. Often these controllers are represented as neural networks, which is the foundation of neuroevolution.
These algorithms have the advantage of being able to explore a diverse set of controllers.
Algorithms based on evaluating policies as a whole have a number of issues. The first is the size of the state space. If there are states and actions, there are policies. For example, for the game described in Example 13.1, there are different policies. For the game of Example 13.2, there are 250 states, and so policies. This is a very small game, but it has more policies than there are particles in the universe.
Second, evolutionary algorithms use experiences very wastefully. If an agent was in state of Example 13.1 and it moved left, you would like it to learn that it is bad to go left from state . But evolutionary algorithms, as presented, wait until the agent has finished and judge the policy as a whole. Stochastic local search will randomly try doing something else in state and so may eventually determine that that action was not good, but it is very indirect. Genetic algorithms are slightly better in that the policies that have the agent going left in state will die off, but again this is very indirect.
Third, the performance of evolutionary algorithms can be very sensitive to the representation of the policy. The representation for a genetic algorithm should be such that crossover preserves the good parts of the policy. The representations are often tuned for the particular domain.
An alternative pursued in the rest of this chapter is to learn after every action. The components of the policy are learned, rather than the policy as a whole. By learning what to do in each state, learning can have linear or polynomial time and space complexity in the number of states, rather than exponential in the number of states. However, evolutionary algorithms can sometimes find creative solutions to problems that elude other methods; see Section 13.9.2.