10.1 Probabilistic Learning

One principled way to choose a model is choose a model that is most likely, given the data. That is, given a data set $E$, choose a model $m$ that maximizes the probability of the model given the data, $P(m\mid E)$. The model that maximizes $P(m\mid E)$ is called the maximum a posteriori probability model, or the MAP model.

The probability of model $m$ given examples $E$ is obtained by using Bayes’ rule:

 $\displaystyle P(m\mid E)$ $\displaystyle=\frac{P(E\mid m)*P(m)}{P(E)}~{}.$ (10.1)

The likelihood, $P(E\mid m)$, is the probability that this model would have produced this data set. It is high when the model is a good fit to the data, and it is low when the model would have predicted different data. The prior probability $P(m)$ encodes the learning bias and specifies which models are a priori more likely. The prior probability of the model, $P(m)$, is used to bias the learning toward simpler models. Typically, simpler models have a higher prior probability. Using the prior is a form of regularization. The denominator $P(E)$, called the partition function, is a normalizing constant to make sure that the probabilities sum to 1.

Because the denominator of Equation 10.1 is independent of the model, it may be ignored when choosing the most likely model. Thus, the MAP model is the model that maximizes

 ${P(E\mid m)*P(m)~{}.}$ (10.2)

One alternative is to choose the maximum likelihood model – the model that maximizes $P(E\mid m)$. The problem with choosing the maximum likelihood model is that, if the space of models is rich enough, a model exists that specifies that this particular data set will be produced, which has $P(E\mid m)=1$. Such a model may be a priori very unlikely. However, we may not want to exclude such a model, because it may be the true model, and, given enough data it might be the best model. Choosing the maximum likelihood model is equivalent to choosing the maximum a posteriori model with a uniform prior over hypotheses. Ockham’s razor suggests instead that we should prefer simpler hypotheses over more complex hypotheses.