### 7.5.1 Maximum A Posteriori Probability and Minimum Description Length

One way to trade off model complexity and fit to the data
is to choose the model that is most likely, given the data. That is,
choose the model that maximizes the probability of the model
given the data, *P(model|data)*. The model that
maximizes *P(model|data)* is called the **maximum a
posteriori probability** model, or the **MAP model**.

The probability of a model (or a hypothesis) given some data is obtained by using Bayes' rule:

P(model|data) = (P(data|model)×P(model))/(P(data)) .

The likelihood, *P(data|model)*, is the probability that this model would have
produced this data set. It is high when the model is a good fit to the
data, and it is low when the model would have predicted different data.
The prior *P(model)* encodes the **learning
bias** and specifies which models are
a priori more likely. The prior probability of the model, *P(model)*, is
required to bias the learning toward simpler models. Typically
simpler models have a higher prior probability. The
denominator *P(data)* is a normalizing constant to make sure that the
probabilities sum to 1.

Because the denominator of Equation (7.5.1) is independent of the model, it can be ignored when choosing the most likely model. Thus, the MAP model is the model that maximizes

P(data|model)×P(model) .

One alternative is to choose the **maximum likelihood model** - the model that maximizes *P(data|model)*. The problem with choosing
the most likely model is
that, if the space of models is rich enough, a model exists that
specifies that this particular data set will be produced, which has
*P(data|model)=1*. Such a model may be a priori very
unlikely. However, we do not want to exclude it, because it may be
the true model. Choosing the maximum-likelihood model is equivalent
to choosing the maximum a posteriori model with a uniform prior over hypotheses.