8.9 References and Further Reading

Goodfellow et al. [2016] provide an overview of neural networks and deep learning. Schmidhuber [2015] provides a comprehensive history and extensive references to the literature. Chollet [2021] provides a readable intuitive overview of deep learning with code (using Python and the Keras library).

McCulloch and Pitts [1943] define a formal neuron. Minsky [1952] showed how such representations can be learned from data. Rosenblatt [1958] introduced the perceptron. Minsky and Papert [1988] is a classic work that analyzed the limitations of the neural networks of the time.

Backpropagation is introduced in Rumelhart et al. [1986]. LeCun et al. [1998b] describe how to effectively implement backpropagation. Ng [2018] provides practical advice on how to build deep learning applications.

LeCun et al. [2015] review how multilayer neural networks have been used for deep learning in many applications. Hinton et al. [2012a] review neural networks for speech recognition, Goldberg [2016] for natural language processing, and Lakshmanan et al. [2021] for vision.

Different activation functions, including ReLU, are investigated in Jarrett et al. [2009] and Glorot et al. [2011]. Ruder [2016] gives an overview of many variants of gradient descent. Nocedal and Wright [2006] provide practical advice on gradient descent and related methods. Karimi et al. [2016] analyze how many iterations of stochastic gradient descent are needed. The Glorot uniform initializer is by Glorot and Bengio [2010]. Dropout is described by Hinton et al. [2012b].

Convolutional neural networks and the MNIST dataset are by LeCun et al. [1998a]. Krizhevsky et al. [2012] describe AlexNet, which used convolutional neural networks to significantly beat the state-of-the-art on ImageNet [Russakovsky et al., 2014], a dataset to predict which of 1000 categories is in an image. Residual networks are described by He et al. [2015].

Jurafsky and Martin [2023] provide a textbook introduction to speech and language processing, which includes more detail on some of the language models presented here. LSTMs were invented by Hochreiter and Schmidhuber [1997]. Gers et al. [2000] introduced the forget gate to LSTMs. Word embeddings were pioneered by Bengio et al. [2003]. The CBOW and Skip-gram models, collectively known as Word2vec, are by Mikolov et al. [2013]. Olah [2015] presents a tutorial introduction to LSTMs.

Attention for machine translation was pioneered by Bahdanau et al. [2015]. Transformers are due to Vaswani et al. [2017]. Alammar [2018] provides a tutorial introduction, and Phuong and Hutter [2022] provide a self-contained introduction, with pseudocode, to transformers. Tay et al. [2022] survey the time and memory complexity of transformer variants. AlphaFold [Senior et al., 2020; Jumper et al., 2021] and RoseTTAFold [Baek et al., 2021] used transformers and other deep learning techniques for protein folding.

Large pre-trained language models are surveyed by Qiu et al. [2020] and Minaee et al. [2021]. Bommasani et al. [2021], calling them foundation models, outlined a research program for large pre-trained models of language, vision, science, and other domains. The language models in Figure 8.15 are ELMo [Peters et al., 2018], BERT [Devlin et al., 2019], Magetron-ML [Shoeybi et al., 2019], GPT-3 [Brown et al., 2020], GShard [Lepikhin et al., 2021], Switch-C [Fedus et al., 2021], Gopher [Rae et al., 2021], and PaLM [Chowdhery et al., 2022]. Shanahan [2022] and Zhang et al. [2022b] discuss what large language models actually learn and what they do not learn.

Srivastava et al. [2022] provide challenge benchmarks of 204 diverse tasks that are more precisely specified than the Turing test and are beyond the capabilities of current language models. Lertvittayakumjorn and Toni [2021] and Qian et al. [2021] survey explainability in natural language systems.

Generative adversarial networks were invented by Goodfellow et al. [2014]. Adversarial debiasing is based on Zhang et al. [2018]. Sohl-Dickstein et al. [2015] and Ho et al. [2020] present diffusion probabilistic models.

Bender et al. [2021] and Weidinger et al. [2021] discuss issues with large pre-trained models, and include a diverse collection of references.