8.6 Other Neural Network Models

8.6.1 Autoencoders

An encoder takes an input and maps it into a vector that can be used for prediction. The dimension of a vector is the number of values in the vector. Figure 8.10 shows a network that embeds words in a smaller embedding than the one-hot embedding on the input. Figure 8.13 shows an encoder for sequences. An encoder is a way to carry out dimensionality reduction when the encoding has a smaller dimension than the input.

An autoencoder is an encoder where the inputs and the outputs are the same. For images, an autoencoder can take an image, map it to a vector, and then map that vector back into the same image. The vector forms a representation of the image, or an encoding of the image. The network learns both to compress (the part of the network between the input and the representation) and to decompress (the part of the network between the representation and the output) images.

An autoencoder can be used as a generative image model, to generate random images or images from text. Consider an autoencoder for images, where the vector representation is small enough so there is little redundant information. An image is mapped to an encoding, which is mapped back to the same image. As the dimensionality of the encoding increases, more detail can be represented. If a random bit vector is used as the embedding, the decoder can be used to produce images, similar to how a decoder produced text in Example 8.11. When the network producing them is a deep network, the generated images are deep fakes. This method by itself does not produce very good images; see Section 8.6.2 below.

An autoencoder can also be used to generate images from text. Given an autoencoder for images, you can train a network to predict the embedding of an image from its caption; the embedding is treated as the target for the language model. Then, given a caption, the language model can produce a vector representation of an image and the decoder part of the autoencoder can be used to produce an image. The autoencoder is useful in this case because it allows for a form of semi-supervised learning, where not all of the images have captions; it uses all the images to train the autoencoder, and the images with captions to train the caption to embedding mapping. It is easier to get images without captions than images with captions.

8.6.2 Adversarial Networks

An adversarial network, such as a generative adversarial network (GAN), is trained both to be able to predict some output, and also not to be able to predict some other output.

One example is in adversarial debiasing for recruitment, where you want to predict whether someone is suitable for a job, but to be blind to race or gender. It is not enough to just remove race and gender from the inputs, because other inputs are correlated with race and gender (e.g., postcodes, poverty, current occupation). One way to make the network blind to race and gender is to create a layer from which race or gender cannot be predicted. The idea is to train the network to predict suitability, but have an adversary that adjusts the embedding of an intermediate layer so that race and gender cannot be predicted from it.

As another example, suppose there is a deep fake model able to generate images from vector representations, as in the decoder of the last section, for which the images are not very realistic. They can be made much more realistic by also training a network N that, given an image, predicts whether the image is real or fake. The image generator can then be trained so that N cannot determine if its output is real or fake. This might produce details that make the output look realistic, even though the details might be just fiction.

8.6.3 Diffusion Models

Diffusion models are effective methods for generative AI, particularly for image generation. The idea is to build a sequence of more noisy images, starting with the data and repeatedly add noise until the result is just noise, and then learn the inverse of this process.

Suppose an input image is x0, and xt is produced from xt1 by adding noise, until xT, for some T (say 1000), is indistinguishable from noise. T neural networks are trained, when network Nt is trained with input xt and output xt1. Thus each network is learning to denoise – reduce the amount of noise in – an image. An image can be generated by starting with random noise, yT, and running it through the networks NT to N1 to produce an image y0. Different noise inputs produce different images.

The details of diffusion models are beyond the scope of this book, with a theory and practice more sophisticated than the characterization here.