foundations of computational agents
Even after the observation of the frequent or constant conjunction of objects, we have no reason to draw any inference concerning any object beyond those of which we have had experience.
– Hume [1739–40], Book I, Part 3, Section 12
One might think that one can learn just based on the data, without any need to appeal to biases or any extra information. This is not possible. There are a number of results that show why preferences over hypotheses are needed.
The no-free-lunch theorem for machine learning specifies that no matter what the training set, for any two definitive predictors and , there are as many functions from the input features to the target features that are consistent with the evidence for which is better than on the off-training set (the examples not in the training set) as when is better than on the off-training set. This includes when is chosen randomly or even is the worst predictor on the training set. Thus, if the functions are all equally likely to be the ground truth, no predictor is better than any other predictor.
As a simple illustration of this, consider -Boolean input features, and the aim is a Boolean classification. There are assignments of the input features, and so functions from assignments into . With no probability to exploit (or assuming a uniform distribution over functions), the best one can do is to use bits to represent a function – one bit for each assignment, which is effectively just memorizing the training data. A training set of size will set of these bits, but the remaining bits – those that are used for off-training examples – are free to be assigned in any way.
This result does not mean learning is impossible. Rather, learning is possible because not all functions are equally likely. Or put another way, learning would be impossible if the world was random, however, because we can learn, the world cannot be random. It does mean that you cannot learn just from data without any extra information; you need a bias.
The most common bias is to choose the simplest hypothesis that fits the data by appealing to Ockham’s razor. The simplest hypothesis depends on the language used to represent hypotheses and how data is represented, as shown in the following example.
Consider defining propositions using three Boolean variables, , , and , as in Example 7.11. The set of assignments is shown in Figure 7.22(a). Suppose, instead, that the space of assignments was represented using variables , , and , where , , and , where is exclusive-or (Figure 5.1). The exclusive-or of propositions is true whenever an odd number of them are true. Figure 7.22(b) shows the truth values for , , and given the values of , , and .
0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 1 | 0 | 1 | 1 |
0 | 1 | 0 | 1 | 1 | 1 | 0 |
0 | 1 | 1 | 0 | 1 | 0 | 1 |
1 | 0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 1 | 0 | 1 | 1 | 0 |
1 | 1 | 0 | 0 | 0 | 1 | 1 |
1 | 1 | 1 | 1 | 0 | 0 | 1 |
(a) | (b) | (c) |
As can be verified from the figure, the same space can be described using features , , and as using , , and . What is simple in one representation may be complicated in the other. For example, is complicated to represent using , , – it requires a big decision tree and is not linearly separable – but is trivial in the other; it is just . Similarly is complicated to represent using , , and , but is trivial in the other; it is just . The representation for – see Figure 7.22(c) – is more complicated in terms of one representation than the other.
One might think that , , and is simpler because the other has a complex description in terms of , , and . However, , , and also has a complex description in terms of , , and ; see Exercise 7.15.
It might appear problematic that simplicity depends on the language and the vocabulary. However, language has evolved to be useful, so this should not necessarily make one suspicious. Data is often represented to make learning useful, and to make biases towards simplicity appropriate. For example, there are many ways to represent an image; representing an image as a grid of pixel values enables some learning algorithms to work well. However, a representation that allows the image to be incrementally improved as it is delivered over time may cause the same algorithms to fail to learn.