7.11 Exercises

Exercise 7.1:
The aim of this exercise is to fill in the table of Figure 7.3.
1. Prove the optimal prediction for training data. To do this, find the minimum value of the absolute error, the sum-of-squares error, the entropy, and the value that gives the maximum likelihood. The maximum or minimum value is either an end point or where the derivative is zero.
2. To determine the best prediction for the test data, assume that the data cases are generated stochastically according to some true parameter p0. Try the following for a number of different values for p0 ∈[0,1]. Generate k training examples (try various values for k, some small, say 5, and some large, say 1,000) by sampling with probability p0; from these generate n0 and n1. Generate a test set that contains many test cases using the same parameter p0. For each of the optimality criteria - sum of absolute values, sum of squares, and likelihood (or entropy) - which of the following gives a lower error on the test set:
1. the mode
2. n1/(n0+n1)
3. if n1=0, use 0.001, if n0=0, use 0.999, else use n1/(n0+n1). (Try this for different numbers when the counts are zero.)
4. (n1+1)/(n0+n1+2)
5. (n1+α)/(n0+n1+2α) for different values of α>0
6. another predictor that is a function of n0 and n1.

You may have to generate many different training sets for each parameter. (For the mathematically sophisticated, can you prove what the optimal predictor is for each criterion?)

Exercise 7.2:
In the context of a point estimate of a feature with domain {0,1} with no inputs, it is possible for an agent to make a stochastic prediction with a parameter p∈[0,1] such that the agent predicts 1 with probability p and predicts 0 otherwise. For each of the following error measures, give the expected error on a training set with n0 occurrences of 0 and n1 occurrences of 1 (as a function of p). What is the value of p that minimizes the error? Is this worse or better than the prediction of Figure 7.3?
1. sum of absolute errors
2. sum-of-squares error
3. worst-case error
Exercise 7.3:
Suppose we have a system that observes a person's TV watching habits in order to recommend other TV shows the person may like. Suppose that we have characterized each show by whether it is a comedy, whether it features doctors, whether it features lawyers, and whether it has guns. Suppose we are given the examples of Figure 7.18 about whether the person likes various TV shows.
 Example Comedy Doctors Lawyers Guns Likes e1 false true false false false e2 true false true false true e3 false false true true true e4 false false true false false e5 false false false true false e6 true false false true false e7 true false false false true e8 false true true true true e9 false true true false false e10 true true true false true e11 true true false true false e12 false false false false false
Figure 7.18: Training examples for Exercise 7.3

We want to use this data set to learn the value of Likes (i.e., to predict which TV shows the person would like based on the attributes of the TV show).

You may find the AISpace.org applets useful for this assignment. (Before you start, see if you can see the pattern in what shows the person likes.)

1. Suppose the error is the sum of absolute errors. Give the optimal decision tree with only one node (i.e., with no splits). What is the error of this tree?
2. Do the same as in part (a), but with the sum-of-squares error.
3. Suppose the error is the sum of absolute errors. Give the optimal decision tree of depth 2 (i.e., the root node is the only node with children). For each leaf in the tree, give the examples that are filtered to that node. What is the error of this tree?
4. Do the same as in part (c) but with the sum-of-squares error.
5. What is the smallest tree that correctly classifies all training examples? Does a top-down decision tree that optimizes the information gain at each step represent the same function?
6. Give two instances that do not appear in the examples of Figure 7.18 and show how they are classified using the smallest decision tree. Use this to explain the bias inherent in the tree. (How does the bias give you these particular predictions?)
7. Is this data set linearly separable? Explain why or why not.
Exercise 7.4:
Consider the decision-tree learning algorithm of Figure 7.5 and the data of Figure 7.1. Suppose, for this question, the stopping criterion is that all of the examples have the same classification. The tree of Figure 7.4 was built by selecting a feature that gives the maximum information gain. This question considers what happens when a different feature is selected.
1. Suppose you change the algorithm to always select the first element of the list of features. What tree is found when the features are in the order [Author, Thread, Length, Where Read]? Does this tree represent a different function than that found with the maximum information gain split? Explain.
2. What tree is found when the features are in the order [Where Read, Thread, Length, Author]? Does this tree represent a different function than that found with the maximum information gain split or the one given for the preceding part? Explain.
3. Is there a tree that correctly classifies the training examples but represents a different function than those found by the preceding algorithms? If so, give it. If not, explain why.
Exercise 7.5:
Consider Equation (7.3.2), which gives the error of a linear prediction.
1. Give a formula for the weights that minimize the error for the case where n=1 (i.e., when there is only one input feature). [Hint: For each weight, differentiate with respect to that weight and set to zero.]
2. Give a set of equations for the weights that minimize the error for arbitrary n.
3. Why is it hard to minimize the error analytically when using a sigmoid linear function (i.e., a squashed linear function when the activation function is a sigmoid or logistic function)?
Exercise 7.6:
Suppose that, in the output of a neural network, we assign any value greater than 0.5 to be true and any less than 0.5 to be false (i.e., any positive value before the activation function is true, and a negative value is false).

Run the AISpace.org neural network learning applet on the data of Figure 7.9 for a neural network with two hidden nodes. Given the final parameter settings found, give a logical formula (or a decision tree or a set of rules) that represents the Boolean function that is the value for the hidden units and the output units. This formula or set of rules should not refer to any real numbers.

[Hint: One brute-force method is to go through the 16 combinations of values for the inputs to each hidden unit and determine the truth value of the output. A better method is to try to understand the functions themselves.]

Does the neural network learn the same function as the decision tree?

Exercise 7.7:
The aim of this exercise is to determine the size of the space of decision trees. Suppose there are n binary features in a learning problem. How many different decision trees are there? How many different functions are represented by these decision trees? Is it possible that two different decision trees give rise to the same function?
Exercise 7.8:
Extend the decision-tree learning algorithm of Figure 7.5 so that multivalued features can be represented. Make it so that the rule form of the decision tree is returned.

One problem that must be overcome is when no examples correspond to one particular value of a chosen feature. You must make a reasonable prediction for this case.

Exercise 7.9:
The decision-tree learning algorithm of Figure 7.5 has to stop if it runs out of features and not all examples agree.

Suppose that you are building a decision tree and you have come to the stage where there are no remaining features to split on and there are examples in the training set, n1 of which are positive and n0 of which are negative. Three strategies have been suggested:

1. Return whichever value has the most examples - return true if n1>n0, false if n1<n0, and either if n1=n0.
2. Return the empirical frequency, n1/(n0+n1).
3. Return (n1+1)/(n0+n1+2).

Which of the following strategies has the least error on the training set?

1. The error is defined as the sum of the absolute differences between the value of the example (1=true and 0=false) and the predicted values in the tree (either 1=true and 0=false or the probability).
2. The error is defined as the sum of the squares of the differences in values.
3. The error is the entropy of the data.

Explain how you derived this answer.

Exercise 7.10:
In choosing which feature to split on in decision-tree search, an alternative heuristic to the max information split of Section 7.3.1.1 is to use the Gini index.

The Gini index of a set of examples (with respect to target feature Y) is a measure of the impurity of the examples:

giniY(Examples) = 1-∑Val ((|{e∈Examples : val(e,Y)=Val}|)/(|Examples|))2

where |{e∈Examples : val(e,Y)=Val}| is the number of examples with value Val of feature Y, and |Examples| is the total number of examples. The Gini index is always non-negative and has value zero only if all of the examples have the same value on the feature. The Gini index reaches its maximum value when the examples are evenly distributed among the values of the features.

One heuristic for choosing which property to split on is to choose the split that minimizes the total impurity of the training examples on the target feature, summed over all of the leaves.

1. Implement a decision-tree search algorithm that uses the Gini index.
2. Try both the Gini index algorithm and the maximum information split algorithm on some databases and see which results in better performance.
3. Find an example database where the Gini index finds a different tree than the maximum information gain heuristic. Which heuristic seems to be better for this example? Consider which heuristic seems more sensible for the data at hand.
4. Try to find an example database where the maximum information split seems more sensible than the Gini index, and try to find another example for which the Gini index seems better. [Hint: Try extreme distributions.]
Exercise 7.11:
As outlined in Example 7.18, define a code for describing decision trees. Make sure that each code corresponds to a decision tree (for every sufficiently long sequence of bits, the initial segment of the sequence will describe a unique decision tree), and each decision tree has a code. How does this code translate into a prior distribution on trees? In particular, how much does the likelihood of introducing a new split have to increase to offset the reduction in prior probability of the split (assuming that smaller trees are easier to describe than large trees in your code)?
Exercise 7.12:
Show how gradient descent can be used for learning a linear function that minimizes the absolute error. [Hint: Do a case analysis of the error. The error is differentiable at every point except when the error is zero, in which case it does not need to be updated.]
Exercise 7.13:
Give an example where a naive Bayesian classifier can give inconsistent results when using empirical frequencies as probabilities. [Hint: You require two features, say A and B, and a binary classification, say C, that has domain {0,1}. Construct a data set where the empirical probabilities give P(a|C=0)=0 and P(b|C=1)=0.] What observation is inconsistent with the model?
Exercise 7.14:
Run the AISpace.org neural network learner on the data of Figure 7.1.
1. Suppose that you decide to use any predicted value from the neural network greater than 0.5 as true, and any value less than 0.5 as false. How many examples are misclassified initially? How many examples are misclassified after 40 iterations? How many examples are misclassified after 80 iterations?
2. Try the same example and the same initial values, with different step sizes for the gradient descent. Try at least η=0.1, η=1.0, and η=5.0. Comment on the relationship between step size and convergence.
3. Given the final parameter values you found, give a logical formula for what each of the units is computing. You can do this by considering, for each of the units, the truth tables for the input values and by determining the output for each combination, then reducing this formula. Is it always possible to find such a formula?
4. All of the parameters were set to different initial values. What happens if the parameter values were all set to the same (random) value? Test it out for this example, and hypothesize what occurs in general.
5. For the neural network algorithm, comment on the following stopping criteria:
1. Learn for a limited number of iterations, where the limit is set initially.
2. Stop when the sum-of-squares error is less than 0.25. Explain why 0.25 may be an appropriate number.
3. Stop when the derivatives all become within some ε of zero.
4. Split the data into training data and test data, and stop when the error on the test data increases.

Which would you expect to better handle overfitting? Which criteria guarantee the gradient descent will stop? Which criteria would guarantee that, if it stops, the network can be used to predict the test data accurately?

Exercise 7.15:
In the neural net learning algorithm, the parameters are updated for each example. To compute the derivative accurately, the parameters should be updated only after all examples have been seen. Implement such a learning algorithm and compare it to the incremental algorithm, with respect to both rate of convergence and to speed of the algorithm.
Exercise 7.16:
1. Draw a kd-tree for the data of Figure 7.1. The topmost feature to split on should be the one that most divides the examples into two equal classes. Assume that you know that the User Action feature does not appear in subsequent queries, and so it should not be split on. Show which training examples are at which leaf nodes.
2. Show the locations in this tree that contain the closest training examples to a new case where the author is unknown, the thread is new, the length is long, and it was read at work.
3. Based on this example, discuss which examples should be returned from a lookup on a kd-tree. Why is this different from a lookup on a decision tree?
Exercise 7.17:
Implement a nearest-neighbor learning system that stores the training examples in a kd-tree and uses the neighbors that differ in the fewest number of features, weighted evenly. How well does this work in practice?