13.12 References and Further Reading

For an introduction to reinforcement learning, see Sutton and Barto [2018], Szepesvári [2010], Kochenderfer et al. [2022], and Powell [2022]. Silver et al. [2021] argue that many problems can be formulated in terms of reward maximization and reinforcement learning.

Langton [1997], De Jong [2006], Salimans et al. [2017], and Such et al. [2017] overview evolutionary computation, including how it is used in reinforcement learning. Stanley et al. [2019] review neuroevolution. Such et al. [2017] show how genetic algorithms can be competitive for hard reinforcement learning algorithms. Lehman et al. [2018] provide many examples of the creativity of evolutionary algorithms.

Temporal-difference learning is by Sutton [1988]. Q-learning is by Watkins and Dayan [1992].

The use of the upper confidence bound for bandit problems was analyzed by Auer et al. [2002]. Russo et al. [2018] provide a tutorial on Thompson [1933] sampling. Kearns et al. [2002] analyze tree search, and Kocsis and Szepesvári [2006] combine tree search with smart exploration.

Schmidhuber [1990] shows how neural networks can simultaneously learn a model and a policy. Mnih et al. [2015] describe how reinforcement learning combined with neural networks was used to solve classic Atari computer games. Silver et al. [2016] show how learning can be used for the game of Go, and Silver et al. [2017] describe AlphaZero that also has superhuman performance in the games of chess and shogi. François-Lavet et al. [2018] and Li [2018] survey deep reinforcement learning.

The social impact section is based on Amodei et al. [2016].