#### 7.3.1.1 Searching for a Good Decision Tree

**Procedure**DecisionTreeLearner(

*X,Y,E*)

2:

**Inputs**

3:

*X*: set of input features,

*X={X*

_{1},...,X_{n}}4:

*Y*: target feature

5:

*E*: set of training examples

6:

**Output**

7: decision tree

8:

**if**stopping criterion is true

**then**

9:

**return**

*pointEstimate(Y,E)*

10:

**else**

11: Select feature

*X*, with domain

_{i}∈X*{v*

_{1},v_{2}}12: let

*E*

_{1}={e∈E: val(e,X_{i})=v_{1}}13: let

*T*

_{1}=DecisionTreeLearner(X \ {X_{i}},Y,E_{1})14: let

*E*

_{2}={e∈E: val(e,X_{i})=v_{2}}15: let

*T*

_{2}=DecisionTreeLearner(X \ {X_{i}},Y,E_{2})16:

**return**

*⟨X*

_{i}=v_{1}, T_{1}, T_{2}⟩17:

18:

**Procedure**DecisionTreeClassify(

*e,X,Y,DT*)

19:

**Inputs**

20:

*X*: set of input features,

*X={X*

_{1},...,X_{n}}21:

*Y*: target feature

22:

*e*: example to classify

23:

*DT*: decision tree

24:

**Output**

25: prediction on

*Y*for example

*e*

26:

**Local**

27:

*S*subtree of

*DT*

28:

*S←DT*

29:

**while**

*S*is an internal node of the form

*⟨X*

_{i}=v,T_{1},T_{2}⟩**do**

30:

**if**

*val(e,X*

_{i})=v**then**

31:

*S←T*

_{1}32:

**else**

33:

*S←T*

_{2}34:

35:

36:

**return**

*S*

A decision tree can be incrementally built from the top down by recursively
selecting a feature to split on and partitioning the training examples with
respect to that feature. In Figure 7.5, the procedure
*DecisionTreeLearner* learns a decision
tree for binary attributes. The decisions
regarding when to stop and which feature to split on are left
undefined. The procedure *DecisionTreeClassify* takes
in a decision tree produced by the learner and makes predictions for a new example.

The algorithm *DecisionTreeLearner* builds a decision tree from the top down as follows: The
input to the algorithm is a set of input features, a target feature,
and a set of examples. The learner first tests if some stopping criterion is
true. If the stopping criterion is true, it returns a point estimate
for *Y*, which is either a value for *Y*
or a probability distribution over the values for *Y*. If the
stopping criterion is not true, the learner
selects a feature *X _{i}* to split on, and for each value

*v*of this feature, it recursively builds a subtree for those examples with

*X*. The returned tree is represented here in terms of triples representing an if-then-else structure.

_{i}=v**Example 7.7:**Consider applying

*DecisionTreeLearner*to the classification data of Figure 7.1. The initial call is

*decisionTreeLearner([Author,Thread,Length,Where Read],User Action,*

*[e*

_{1},e_{2},...,e_{18}]).Suppose the stopping criterion is not true and the algorithm selects the feature
*Length* to split on. It then calls

*decisionTreeLearner([Where Read,Thread,Author],User Action,*

*[e*

_{1},e_{3},e_{4},e_{6},e_{9},e_{10},e_{12}]).All of these examples agree on the user action; therefore, the algorithm returns the prediction *skips*.
The second step of the recursive call is

*decisionTreeLearner([Where Read,Thread,Author],User Action,*

*[e*

_{2},e_{5},e_{7},e_{8},e_{11},e_{13},e_{14},e_{15},e_{16},e_{17},e_{18}]).Not all of the examples agree on the user action, so the algorithm selects a
feature to split on. Suppose it selects *Thread*.
Eventually, this recursive call returns the subtree for the case when *Length*
is *short*, such as

⟨Thread=new,reads,⟨Author=unknown,skips,reads⟩⟩.

The final result is

*<Length=long,skips,*

*⟨Thread=new,reads,⟨Author=unknown,skips,reads⟩⟩>*

The learning algorithm of Figure 7.5 leaves three choices unspecified:

- The stopping criterion is not defined. The learner should stop when there are no input features, when all of the examples have the same classification, or when no splitting would improve the classification ability of the resulting tree. The last is the most difficult criterion to test for; see below.
- What should be returned at the leaves is not defined. This is a point estimate because, at this step, all of the other input features are ignored. This prediction is typically the most likely classification, the median or mean value, or a probability distribution over the classifications. [See Exercise 7.9.]
- Which feature to select to split on is not defined. The aim is
to choose the feature that will result in the smallest tree. The
standard way to do this is to choose the
**myopically optimal**split: if the learner were only allowed one split, which single split would result in the best classification? With the sum-of-squares error, for each feature, determine the error of the resulting tree based on a single split on that feature. For the likelihood or the entropy, the myopically optimal split is the one that gives the maximum**information gain**. Sometimes information gain is used even when the optimality criterion is the sum-of-squares error. An alternative, the**Gini index**, is investigated in Exercise 7.10.

**Example 7.8:**Consider learning the user action from the data of Figure 7.1, where we split on the feature with the maximum information gain or we myopically choose the split that minimizes the entropy or maximizes the likelihood of the data. See Section 6.1.5 for the definition of information used here.

The information content of all examples with respect to
feature *User Action* is 1.0, because there are *9* examples with
*User Action=reads* and *9* examples with *User Action=skips*.

Splitting on *Author* partitions the examples into
*[e _{1}*,

*e*,

_{4}*e*,

_{5}*e*,

_{6}*e*,

_{9}*e*,

_{10}*e*,

_{12}*e*,

_{13}*e*,

_{14}*e*,

_{15}*e*,

_{16}*e*with

_{17}]*Author=known*and

*[e*,

_{2}*e*,

_{3}*e*,

_{7}*e*,

_{8}*e*,

_{11}*e*with

_{18}]*Author=unknown*, each of which is evenly split between the different user actions. The information gain for the test

*Author*is zero. In this case, finding out whether the author is known, by itself, provides no information about what the user action will be.

Splitting on *Thread* partitions the examples into
*[e _{1}*,

*e*,

_{2}*e*,

_{5}*e*,

_{8}*e*,

_{10}*e*,

_{12}*e*,

_{14}*e*,

_{15}*e*,

_{17}*e*and

_{18}]*[e*,

_{3}*e*,

_{4}*e*,

_{6}*e*,

_{7}*e*,

_{9}*e*,

_{11}*e*,

_{13}*e*. The first set of examples, those with

_{16}]*Thread=new*, contains

*3*examples with

*User Action=skips*and

*7*examples with

*User Action=reads*; thus, the information content of this set with respect to the user action is

-0.3 × log_{2}0.3 - 0.7 × log_{2}0.7 = 0.881

and so the information gain is 0.119.

Similarly, the examples with *Thread=old* divide up *6:2* according to
the user action and thus have information content *0.811*. The
expected information gain is thus *1.0 - [(10/18)×0.881 + (8/18)
×0.811] = 0.150*.

The test *Length* divides the examples into *[e _{1}*,

*e*,

_{3}*e*,

_{4}*e*,

_{6}*e*,

_{9}*e*,

_{10}*e*and

_{12}]*[e*,

_{2}*e*,

_{5}*e*,

_{7}*e*,

_{8}*e*,

_{11}*e*,

_{13}*e*,

_{14}*e*,

_{15}*e*,

_{16}*e*,

_{17}*e*. The former all agree on the value of

_{18}]*User Action*and so have information content zero. The user action divides the second set

*9:2*, and so the information is

*0.684*. Thus, the expected information gain by the test

*length*is

*1.0 - 11/18 ×0.684 = 0.582*. This is the highest information gain of any test and so

*Length*is chosen to split on.

In choosing which feature to split on, the information content before the test is the same for all tests, and so the learning agent can choose the test that results in the minimum expected information after the test.

The algorithm of Figure 7.5 assumes each input feature has only two values. This restriction can be lifted in two ways:

- Allow a multiway split. To split on a multivalued variable, there would be a child for each value in the domain of the variable. This means that the representation of the decision tree becomes more complicated than the simple if-then-else form used for binary features. There are two main problems with this approach. The first is what to do with values for which there are no training examples. The second is that, for most myopic splitting heuristics, including information gain, it is generally better to split on a variable with a larger domain because it produces more children and so can fit the data better than splitting on a feature with a smaller domain. [See Exercise 7.8.] However, splitting on a feature with a smaller domain keeps the representation more compact.
- Partition the domain into two disjoint subsets. When the
domain is totally ordered, such as if the domain is a
subset of the real numbers, the domain can be split into values
less than some threshold and those greater than the threshold. For
example, the children could correspond to
*X<v*and*X ≥ v*for some value*v*in the domain of*X*. A myopically optimal value for*v*can be chosen in one sweep through the data by sorting the data on the value of*X*and considering each split that partitions the values. When the domain does not have a natural ordering, a split can be performed on arbitrary subsets of the domain. In this case, the myopically optimal split can be found by sorting values that appear in the data on the probability of classification.

If there is noise in the data, a major problem of the preceding
algorithm is **overfitting** the data.
Overfitting occurs when the algorithm tries to fit
distinctions that appear in
the training data but do not appear in the unseen examples.
This occurs when random correlations exist in the training
data that are not reflected in the data set as a whole.
Section 7.5 discusses ways to detect overfitting.
There are two ways to overcome
the problem of overfitting in decision trees:

- Restrict the splitting to split only when the split is useful.
- Allow unrestricted splitting and then prune the resulting tree where it makes unwarranted distinctions.

The second method seems to work better in practice. One reason is that it is possible that two features together predict well but one of them, by itself, is not very useful, as shown in the following example.

**Example 7.9:**Suppose the aim is to predict whether a game of matching pennies is won or not. The input features are

*A*, whether the first coin is heads or tails;

*B*, whether the second coin is heads or tails; and

*C*, whether there is cheering. The target feature,

*W*, is true when there is a win, which occurs when both coins are heads or both coins are tails. Suppose cheering is correlated with winning. This example is tricky because

*A*by itself provides no information about

*W*, and

*B*by itself provides no information about

*W*. However, together they perfectly predict

*W*. A myopic split may first split on

*C*, because this provides the most myopic information. If all the agent is told is

*C*, this is much more useful than

*A*or

*B*. However, if the tree eventually splits on

*A*and

*B*, the split on

*C*is not needed. Pruning can remove

*C*as being useful, whereas stopping early will keep the split on

*C*.

A discussion of how to trade off model complexity and fit to the data is presented in Section 7.5.