10.1 Probabilistic Learning

Training examples provide evidence that can be conditioned on. Bayes’ rule specifies how to determine the probability of model m given examples E:

P(mE) =P(Em)P(m)P(E). (10.1)

The likelihood, P(Em), is the probability that this model would have produced this dataset. It is high when the model is a good fit to the data. The prior probability, P(m), encodes a learning bias and specifies which models are a priori more likely, and can be used to bias the learning toward simpler models. The denominator, P(E), is the partition function, a normalizing constant to make sure that the probabilities sum to 1.

In Chapter 7, the aim was to fit the data as well as possible, using the maximum likelihood model – the model that maximizes P(Em) – but then we had to use seemingly ad hoc regularization to avoid overfitting and better fit to test data. One problem with choosing the maximum likelihood model is that, if the space of models is rich enough, a model exists that specifies that this particular dataset will be produced, which has P(Em)=1. For example, a decision tree can represent any discrete function, but can overfit training data.

The model that maximizes P(mE) is called the maximum a posteriori probability model, or MAP model. Because the denominator of Equation 10.1 is independent of the model, it may be ignored when choosing the most likely model. Thus, the MAP model is the model that maximizes

P(Em)P(m). (10.2)

It takes into account both the likelihood (fit to the data) and the prior, which can be used as a learning bias, such as a preference for simpler models.