10.1.4 Description Length

The negative of the logarithm (base 2) of Formula (10.2) is

 $(-\log_{2}P(E\mid m))+(-\log_{2}P(m)).$

This can be interpreted in terms of information theory. The first term is the number of bits it takes to describe the data given the model $m$, and the second term is the number of bits it takes to describe the model. A model that minimizes this sum is a minimum description length (MDL) model. The MDL principle is to choose the model that minimizes the number of bits it takes to describe both the model and the data given the model.

One way to think about the MDL principle is that the aim is to communicate the data as succinctly as possible. The model makes communication shorter. To communicate the data, first communicate the model, then communicate the data in terms of the model. The number of bits it takes to communicate the data using a model is the number of bits it takes to communicate the model plus the number of bits it takes to communicate the data in terms of the model.

As the logarithm function is monotonically increasing, the MAP model is the same the MDL model. Choosing a model with the highest posterior probability is the same as choosing a model with a minimum description length.

The description length gives us a way to have common units between probabilities and model complexity. They can both be described in terms of bits.

Example 10.7.

In Example 10.6, the definition of the priors on decision trees was left unspecified. The notion of a description length provides a basis for assigning priors to decision trees; consider how many bits it takes to describe a decision tree (see Exercise 7). One must be careful defining the codes, because each code should describe a unique decision tree, and each decision tree should be described by a unique code.

Defining a code is difficult. It is often useful to approximate the description length of the model. One way to approximate the description length is to consider just representing the probabilistic parameters of the model. Let $|m|$ be the number of probabilistic parameters of the model. Suppose $|E|$ is the number of training examples. There are at most $|E|+1$ different probabilities the model needs to distinguish. It takes $\log_{2}(|E|+1)$ bits to distinguish these probabilities. Thus, the problem of finding the MDL model can be approximated by minimizing

 $-\log_{2}P(E\mid m)+|m|*\log_{2}(|E|)$

This value is the Bayesian information criteria (BIC) score.

For a decision tree with probabilities at the leaves $|m|$ is the number of leaves. For a linear function or a neural network it is the number of numerical parameters.