8.8 Review

  • Artificial neural networks are parametrized models for predictions, typically made of multiple layers of parameterized linear functions and nonlinear activation functions.

  • The output is typically a linear function for a real-valued prediction, with a sigmoid for a Boolean prediction, or with a softmax for a categorical prediction. Other outputs, such as sequences or structured predictions, use specialized methods.

  • Neural networks that use ReLU for all hidden units define piecewise linear functions if they have a linear output, or piecewise linear separators if they have a sigmoid output.

  • Backpropagation can be used for training parameters of differentiable (almost everywhere) functions.

  • Gradient descent is used to train by making steps proportional to the negation of the gradient; many variants improve the basic algorithm by adjusting the step size and adding momentum.

  • Convolutional neural networks apply learnable filters to multiple positions on a grid.

  • Recurrent neural networks can be used for sequences. An LSTM is a type of RNN that solves the vanishing gradients problem.

  • Attention for text is used to compute the expected embedding of words based on their relationship with other words. Attention is also used for speech, vision, and other tasks.

  • Transformers, using layers of linear transformation and attention, are the workhorse for modern language processing, computer vision, and biology.

  • Neural networks are used for generative AI, for the generation of images, text, code, molecules, and other structured output.

  • Neural networks are very successful for applications where there are large training sets, or where training data can be generated from a model.

  • It can be dangerous to make decisions based on data of dubious quality; large quantity and high quality are difficult to achieve together.