Intelligence 2E

foundations of computational agents

The previous examples did not need the prior on the structure of models, as all the models were equally complex. However, for learning decision trees, you need a bias, typically in favor of smaller decision trees. The prior probability provides this bias.

If there are no examples with the same values for the input features but different values for the target feature, there are always multiple decision trees that fit the data perfectly. If the training examples do not cover every assignment to the input variables, multiple trees will fit the data perfectly. Moreover, for every assignment of values not observed, there are decision trees that perfectly fit the training set, and make opposite predictions on the unseen examples.

If there is a possibility of noise, none of the trees that perfectly fit the training set may be the best model. Not only do we want to compare the models that fit the data perfectly; we also want to compare those models with the models that do not necessarily fit the data perfectly. MAP learning provides a way to compare these models.

Suppose there are multiple decision trees that accurately fit the data. If $m$ denotes one of those decision trees, $P(E\mid m)=1$. The preference for one decision tree over another depends on the prior probabilities of the decision trees; the prior probability encodes the learning bias. The preference for simpler decision trees over more complicated decision trees reflect the fact that simpler decision trees have a higher prior probability.

Bayes’ rule gives a way to trade off simplicity and the ability to handle noise. Decision trees handle noisy data by having probabilities at the leaves. When there is noise, larger decision trees fit the training data better, because the tree can model random regularities (noise) in the training data. In decision tree learning, the likelihood favors bigger decision trees; the more complicated the tree, the better it can fit the data. The prior distribution typically favors smaller decision trees. When there is a prior distribution over decision trees, Bayes’ rule specifies how to trade off model complexity and accuracy. The posterior probability of the model given the data is proportional to the product of the likelihood and the prior.

Consider the data of Figure 7.1, where the learner is required to predict the user’s actions.

One possible decision tree is the one given on the left of Figure 7.6. Call this decision tree ${{d}}_{{\mathrm{2}}}$. The likelihood of the data is ${P}{\mathit{}}{\mathrm{(}}{E}{\mathrm{\mid}}{{d}}_{{\mathrm{2}}}{\mathrm{)}}{\mathrm{=}}{\mathrm{1}}$. That is, ${{d}}_{{\mathrm{2}}}$ accurately fits the data.

Another possible decision tree is one with no internal nodes, and a single leaf that predicts ${r}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{s}$ with probability $\frac{{\mathrm{1}}}{{\mathrm{2}}}$. This is the most likely tree with no internal nodes, given the data. Call this decision tree ${{d}}_{{\mathrm{0}}}$. The likelihood of the data given this model is

$${P}{}{(}{E}{\mid}{{d}}_{{0}}{)}{=}{{\left(}\frac{{1}}{{2}}{\right)}}^{{9}}{*}{{\left(}\frac{{1}}{{2}}{\right)}}^{{9}}{\approx}{0.00000149}{.}$$ |

Another possible decision tree is one on the right of Figure 7.6, with one split on ${L}{\mathit{}}{e}{\mathit{}}{n}{\mathit{}}{g}{\mathit{}}{t}{\mathit{}}{h}$, and with probabilities on the leaves given by ${P}{\mathrm{(}}{r}{e}{a}{d}{s}{\mathrm{\mid}}{L}{e}{n}{g}{t}{h}{\mathrm{=}}{l}{o}{n}{g}{\mathrm{)}}{\mathrm{=}}{\mathrm{0}}$ and ${P}{\mathrm{(}}{r}{e}{a}{d}{s}{\mathrm{\mid}}{L}{e}{n}{g}{t}{h}{\mathrm{=}}{s}{h}{o}{r}{t}{\mathrm{)}}{\mathrm{=}}\frac{{\mathrm{9}}}{{\mathrm{11}}}$. Note that $\frac{{\mathrm{9}}}{{\mathrm{11}}}$ is the empirical frequency of ${r}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{s}$ among the training set with ${L}{\mathit{}}{e}{\mathit{}}{n}{\mathit{}}{g}{\mathit{}}{t}{\mathit{}}{h}{\mathrm{=}}{s}{\mathit{}}{h}{\mathit{}}{o}{\mathit{}}{r}{\mathit{}}{t}$. Call this decision tree ${{d}}_{{\mathrm{1}}{\mathit{}}{a}}$. The likelihood of the data given this model is

$${P}{}{(}{E}{\mid}{{d}}_{{1}{}{a}}{)}{=}{{1}}^{{7}}{*}{{\left(}\frac{{9}}{{11}}{\right)}}^{{9}}{*}{{\left(}\frac{{2}}{{11}}{\right)}}^{{2}}{\approx}{0.0543}{.}$$ |

Another possible decision tree is one that just splits on ${T}{\mathit{}}{h}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}$, and with probabilities on the leaves given by ${P}{\mathrm{(}}{r}{e}{a}{d}{s}{\mathrm{\mid}}{T}{h}{r}{e}{a}{d}{\mathrm{=}}{n}{e}{w}{\mathrm{)}}{\mathrm{=}}\frac{{\mathrm{7}}}{{\mathrm{10}}}$ (as 7 out of the 10 examples with ${T}{\mathit{}}{h}{\mathit{}}{r}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}{\mathrm{=}}{n}{\mathit{}}{e}{\mathit{}}{w}$ have ${U}{\mathit{}}{s}{\mathit{}}{e}{\mathit{}}{r}{\mathit{}}{\mathrm{\_}}{\mathit{}}{a}{\mathit{}}{c}{\mathit{}}{t}{\mathit{}}{i}{\mathit{}}{o}{\mathit{}}{n}{\mathrm{=}}{r}{\mathit{}}{e}{\mathit{}}{a}{\mathit{}}{d}{\mathit{}}{s}$), and ${P}{\mathrm{(}}{r}{e}{a}{d}{s}{\mathrm{\mid}}{T}{h}{r}{e}{a}{d}{\mathrm{=}}{f}{o}{l}{l}{o}{w}{\mathrm{\_}}{u}{p}{\mathrm{)}}{\mathrm{=}}\frac{{\mathrm{2}}}{{\mathrm{8}}}$. Call this decision tree ${{d}}_{{\mathrm{1}}{\mathit{}}{t}}$. The likelihood of the data given ${{d}}_{{\mathrm{1}}{\mathit{}}{t}}$ is

$${P}{}{(}{E}{\mid}{{d}}_{{1}{}{t}}{)}{=}{{\left(}\frac{{7}}{{10}}{\right)}}^{{7}}{*}{{\left(}\frac{{3}}{{10}}{\right)}}^{{3}}{*}{{\left(}\frac{{6}}{{8}}{\right)}}^{{6}}{*}{{\left(}\frac{{2}}{{8}}{\right)}}^{{2}}{\approx}{0.000025}{.}$$ |

These are just four of the possible decision trees. Which is best depends on the prior on trees. The likelihood of the data is multiplied by the prior probability of the decision trees to determine the posterior probability of the decision tree.