Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

7.5.2 Cross Validation

The problem with the previous methods is that they require a notion of simplicity to be known before the agent has seen any data. It would seem as though an agent should be able to determine, from the data, how complicated a model needs to be. Such a method could be used when the learning agent has no prior information about the world.

The idea of cross validation is to split the training set into two: a set of examples to train with, and a validation set. The agent trains using the new training set. Prediction on the validation set is used to determine which model to use.

Consider a graph such as the one in Figure 7.13. The error of the training set gets smaller as the size of the tree grows. The idea of cross validation is to choose the representation in which the error of the validation set is a minimum. In these cases, learning can continue until the error of the validation set starts to increase.

The validation set that is used as part of training is not the same as the test set. The test set is used to evaluate how well the learning algorithm works as a whole. It is cheating to use the test set as part of learning. Remember that the aim is to predict examples that the agent has not seen. The test set acts as a surrogate for these unseen examples, and so it cannot be used for training or validation.

Typically, we want to train on as many examples as possible, because then we get better models. However, having a small validation set means that the validation set may fit well, or not fit well, just by luck. There are various methods that have been used to reuse examples for both training and validation.

One method, k-fold cross validation, is used to determine the best model complexity, such as the depth of a decision tree or the number of hidden units in a neural network. The method of k-fold cross validation partitions the training set into k sets. For each model complexity, the learner trains k times, each time using one of the sets as the validation set and the remaining sets as the training set. It then selects the model complexity that has the smallest average error on the validation set (averaging over the k runs). It can return the model with that complexity, trained on all of the data.