Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

7.5.1.2 Description Length

The negative of the logarithm (base 2) of Formula (7.5.1) is

(- log 2 P(data|model))+( -  log 2 P(model)).

This can be interpreted in terms of information theory. The left-hand side of this expression is the number of bits it takes to describe the data given the model. The right-hand side is the number of bits it takes to describe the model. A model that minimizes this sum is a minimum description length (MDL) model. The MDL principle is to choose the model that minimizes the number of bits it takes to describe both the model and the data given the model.

One way to think about the MDL principle is that the aim is to communicate the data as succinctly as possible. The use of the model is to make communication shorter. To communicate the data, first communicate the model, then communicate the data in terms of the model. The number of bits it takes to communicate the data using a model is the number of bits it takes to communicate the model plus the number of bits it takes to communicate the data in terms of the model. The MDL principle is used to choose the model that lets us communicate the data in as few bits as possible.

As the logarithm function is monotonically increasing, the MAP model is the same the MDL model. The idea of choosing a model with the highest posterior probability is the same as choosing a model with a minimum description length.

Example 7.19: In Example 7.18, the definition of the priors on decision trees was left unspecified. The notion of a description length provides a basis for assigning priors to decision trees; consider how many bits it takes to describe a decision tree [see Exercise 7.11]. One must be careful defining the codes, because each code should describe a decision tree, and each decision tree should be described by a code.