Artificial Intelligence - foundations of computational agents -- 7.5.1.1 MAP Learning of Decision Trees

Third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including the full text).

7.5.1.1 MAP Learning of Decision Trees

To understand MAP learning, consider how it can be used to learn decision trees. If there are no examples with the same values for the input features and different values for the target features, there are always decision trees that fit the data perfectly. If the training examples do not cover all of the assignments to the input variables, multiple trees will fit the data perfectly. However, with noise, none of these may be the best model. Not only do we want to compare the models that fit the data perfectly; we also want to compare those models with the models that do not necessarily fit the data perfectly. MAP learning provides a way to compare these models.

Suppose there are multiple decision trees that accurately fit the data. If model denotes one of those decision trees, P(data|model)=1. The preference for one decision tree over another depends on the prior probabilities of the decision trees; the prior probability encodes the learning bias. The preference for simpler decision trees over more complicated decision trees occurs because simpler decision trees have a higher prior probability.

Bayes' rule gives a way to trade off simplicity and ability to handle noise. Decision trees can handle noisy data by having probabilities at the leaves. When there is noise, larger decision trees fit the training data better, because the tree can account for random regularities (noise) in the training data. In decision-tree learning, the likelihood favors bigger decision trees; the more complicated the tree, the better it can fit the data. The prior distribution can favor smaller decision trees. When there is a prior distribution over decision trees, Bayes' rule specifies how to trade off model complexity and accuracy: The posterior probability of the model given the data is proportional to the product of the likelihood and the prior.

Example 7.18: Consider the data of Figure 7.1, where the learner is to predict the user's actions.

One possible decision tree is the one given on the left of Figure 7.4. Call this decision tree d₂. The likelihood of the data is P(data|d₂)=1. That is, d₂ accurately fits the data.

Another possible decision tree is one with no internal nodes, and a leaf that says to predict reads with probability (1)/(2). This is the most likely tree with no internal nodes, given the data. Call this decision tree d₀. The likelihood of the data given this model is

P(data|d₀) = ((1)/(2))⁹ ×((1)/(2))⁹ approx 0.00000149.

Another possible decision tree is one on the right of Figure 7.4, which just splits on Length, and with probabilities on the leaves given by P(reads|Length=long)=0 and P(reads|Length=short)=(9)/(11). Note that (9)/(11) is the empirical frequency of reads among the training set with Length=short. Call this decision tree d_1a. The likelihood of the data given this model is

P(data|d_1a) = 1⁷ ×((9)/(11))⁹ ×((2)/(11))² approx 0.0543.

Another possible decision tree is one that just splits on Thread, and with probabilities on the leaves given by P(reads|Thread=new)=(7)/(10) (as 7 out of the 10 examples with Thread=new have User Action=reads), and P(reads|Thread=follow Up)=(2)/(8). Call this decision tree d_1t. The likelihood of the data given d_1t is

P(data|d_1t) = ((7)/(10))⁷ ×((3)/(10))³ ×((6)/(8))⁶ ×((2)/(8))² approx 0.000025.

These are just four of the possible decision trees. Which is best depends on the prior on trees. The likelihood of the data is multiplied by the prior probability of the decision trees to determine the posterior probability of the decision tree.