7.5.1 Maximum A Posteriori Probability and Minimum Description Length

One way to trade off model complexity and fit to the data is to choose the model that is most likely, given the data. That is, choose the model that maximizes the probability of the model given the data, P(model|data). The model that maximizes P(model|data) is called the maximum a posteriori probability model, or the MAP model.

The probability of a model (or a hypothesis) given some data is obtained by using Bayes' rule:

P(model|data) = (P(data|model)×P(model))/(P(data)) .

The likelihood, P(data|model), is the probability that this model would have produced this data set. It is high when the model is a good fit to the data, and it is low when the model would have predicted different data. The prior P(model) encodes the learning bias and specifies which models are a priori more likely. The prior probability of the model, P(model), is required to bias the learning toward simpler models. Typically simpler models have a higher prior probability. The denominator P(data) is a normalizing constant to make sure that the probabilities sum to 1.

Because the denominator of Equation (7.5.1) is independent of the model, it can be ignored when choosing the most likely model. Thus, the MAP model is the model that maximizes

P(data|model)×P(model) .

One alternative is to choose the maximum likelihood model - the model that maximizes P(data|model). The problem with choosing the most likely model is that, if the space of models is rich enough, a model exists that specifies that this particular data set will be produced, which has P(data|model)=1. Such a model may be a priori very unlikely. However, we do not want to exclude it, because it may be the true model. Choosing the maximum-likelihood model is equivalent to choosing the maximum a posteriori model with a uniform prior over hypotheses.