7.11 Exercises
 Exercise 7.1:

The aim of this exercise is to fill in the table of
Figure 7.3.
 Prove the optimal prediction for training data. To do this, find the minimum value of the absolute error, the sumofsquares error, the entropy, and the value that gives the maximum likelihood. The maximum or minimum value is either an end point or where the derivative is zero.
 To determine the best prediction for the test data, assume that
the data cases are generated stochastically according to some true parameter
p_{0}. Try the following for a number of different values for p_{0}
∈[0,1]. Generate k training examples (try various values for
k, some small, say 5, and some large, say 1,000) by sampling with
probability p_{0}; from these
generate n_{0} and n_{1}. Generate a test set that contains many
test cases using the same parameter p_{0}. For each of the optimality
criteria  sum of absolute values, sum of squares, and likelihood (or entropy)  which of the following gives
a lower error on the test set:
 the mode
 n_{1}/(n_{0}+n_{1})
 if n_{1}=0, use 0.001, if n_{0}=0, use 0.999, else use n_{1}/(n_{0}+n_{1}). (Try this for different numbers when the counts are zero.)
 (n_{1}+1)/(n_{0}+n_{1}+2)
 (n_{1}+α)/(n_{0}+n_{1}+2α) for different values of α>0
 another predictor that is a function of n_{0} and n_{1}.
You may have to generate many different training sets for each parameter. (For the mathematically sophisticated, can you prove what the optimal predictor is for each criterion?)
 Exercise 7.2:

In the context of a point estimate of a feature with domain {0,1} with no inputs,
it is possible for an agent to make a stochastic prediction with a
parameter p∈[0,1] such that the agent predicts 1 with probability
p and predicts 0 otherwise. For each of the following error measures,
give the expected error on a training set with n_{0} occurrences of 0
and n_{1} occurrences of 1 (as a function of p). What is the value of
p that minimizes the error? Is this worse or
better than the prediction of Figure 7.3?
 sum of absolute errors
 sumofsquares error
 worstcase error
 Exercise 7.3:

Suppose we have a system that observes a person's TV watching habits in
order to recommend other TV shows the person may like. Suppose that we
have characterized each show by whether it is a comedy, whether it
features doctors, whether it features lawyers, and whether it has guns.
Suppose we are given the examples of Figure 7.18 about
whether the person likes various
TV shows.
Example Comedy Doctors Lawyers Guns Likes e_{1} false true false false false e_{2} true false true false true e_{3} false false true true true e_{4} false false true false false e_{5} false false false true false e_{6} true false false true false e_{7} true false false false true e_{8} false true true true true e_{9} false true true false false e_{10} true true true false true e_{11} true true false true false e_{12} false false false false false We want to use this data set to learn the value of Likes (i.e., to predict which TV shows the person would like based on the attributes of the TV show).
You may find the AISpace.org applets useful for this assignment. (Before you start, see if you can see the pattern in what shows the person likes.)
 Suppose the error is the sum of absolute errors. Give the optimal decision tree with only one node (i.e., with no splits). What is the error of this tree?
 Do the same as in part (a), but with the sumofsquares error.
 Suppose the error is the sum of absolute errors. Give the optimal decision tree of depth 2 (i.e., the root node is the only node with children). For each leaf in the tree, give the examples that are filtered to that node. What is the error of this tree?
 Do the same as in part (c) but with the sumofsquares error.
 What is the smallest tree that correctly classifies all training examples? Does a topdown decision tree that optimizes the information gain at each step represent the same function?
 Give two instances that do not appear in the examples of Figure 7.18 and show how they are classified using the smallest decision tree. Use this to explain the bias inherent in the tree. (How does the bias give you these particular predictions?)
 Is this data set linearly separable? Explain why or why not.
 Exercise 7.4:
 Consider the decisiontree learning algorithm of
Figure 7.5 and the data of Figure 7.1. Suppose, for this question, the stopping criterion is that all
of the examples have the same classification.
The tree of Figure 7.4 was built by selecting a feature
that gives the maximum information gain.
This question considers what happens
when a different feature is selected.
 Suppose you change the algorithm to always select the first element of the list of features. What tree is found when the features are in the order [Author, Thread, Length, Where Read]? Does this tree represent a different function than that found with the maximum information gain split? Explain.
 What tree is found when the features are in the order [Where Read, Thread, Length, Author]? Does this tree represent a different function than that found with the maximum information gain split or the one given for the preceding part? Explain.
 Is there a tree that correctly classifies the training examples but represents a different function than those found by the preceding algorithms? If so, give it. If not, explain why.
 Exercise 7.5:

Consider
Equation (7.3.2),
which gives the error of a linear prediction.
 Give a formula for the weights that minimize the error for the case where n=1 (i.e., when there is only one input feature). [Hint: For each weight, differentiate with respect to that weight and set to zero.]
 Give a set of equations for the weights that minimize the error for arbitrary n.
 Why is it hard to minimize the error analytically when using a sigmoid linear function (i.e., a squashed linear function when the activation function is a sigmoid or logistic function)?
 Exercise 7.6:
 Suppose that, in the output of a neural network, we assign any value
greater than 0.5 to be true and any less than 0.5 to be false (i.e., any
positive value before the activation function is true, and a negative
value is false).
Run the AISpace.org neural network learning applet on the data of Figure 7.9 for a neural network with two hidden nodes. Given the final parameter settings found, give a logical formula (or a decision tree or a set of rules) that represents the Boolean function that is the value for the hidden units and the output units. This formula or set of rules should not refer to any real numbers.
[Hint: One bruteforce method is to go through the 16 combinations of values for the inputs to each hidden unit and determine the truth value of the output. A better method is to try to understand the functions themselves.]
Does the neural network learn the same function as the decision tree?
 Exercise 7.7:
 The aim of this exercise is to determine the size of the space of decision trees. Suppose there are n binary features in a learning problem. How many different decision trees are there? How many different functions are represented by these decision trees? Is it possible that two different decision trees give rise to the same function?
 Exercise 7.8:

Extend the decisiontree learning algorithm of Figure 7.5 so that
multivalued features can be represented. Make it so that the
rule form of the decision tree is returned.
One problem that must be overcome is when no examples correspond to one particular value of a chosen feature. You must make a reasonable prediction for this case.
 Exercise 7.9:

The decisiontree learning algorithm of Figure 7.5 has
to stop if it
runs out of features and not all examples agree.
Suppose that you are building a decision tree and you have come to the stage where there are no remaining features to split on and there are examples in the training set, n_{1} of which are positive and n_{0} of which are negative. Three strategies have been suggested:
 Return whichever value has the most examples  return true if n_{1}>n_{0}, false if n_{1}<n_{0}, and either if n_{1}=n_{0}.
 Return the empirical frequency, n_{1}/(n_{0}+n_{1}).
 Return (n_{1}+1)/(n_{0}+n_{1}+2).
Which of the following strategies has the least error on the training set?
 The error is defined as the sum of the absolute differences between the value of the example (1=true and 0=false) and the predicted values in the tree (either 1=true and 0=false or the probability).
 The error is defined as the sum of the squares of the differences in values.
 The error is the entropy of the data.
Explain how you derived this answer.
 Exercise 7.10:
 In choosing which feature to split on in decisiontree search,
an alternative heuristic to the max information split of
Section 7.3.1.1 is to use the Gini index.
The Gini index of a set of examples (with respect to target feature Y) is a measure of the impurity of the examples:
gini_{Y}(Examples) = 1∑_{Val} (({e∈Examples : val(e,Y)=Val})/(Examples))^{2}
where {e∈Examples : val(e,Y)=Val} is the number of examples with value Val of feature Y, and Examples is the total number of examples. The Gini index is always nonnegative and has value zero only if all of the examples have the same value on the feature. The Gini index reaches its maximum value when the examples are evenly distributed among the values of the features.
One heuristic for choosing which property to split on is to choose the split that minimizes the total impurity of the training examples on the target feature, summed over all of the leaves.
 Implement a decisiontree search algorithm that uses the Gini index.
 Try both the Gini index algorithm and the maximum information split algorithm on some databases and see which results in better performance.
 Find an example database where the Gini index finds a different tree than the maximum information gain heuristic. Which heuristic seems to be better for this example? Consider which heuristic seems more sensible for the data at hand.
 Try to find an example database where the maximum information split seems more sensible than the Gini index, and try to find another example for which the Gini index seems better. [Hint: Try extreme distributions.]
 Exercise 7.11:
 As outlined in Example 7.18, define a code for describing decision trees. Make sure that each code corresponds to a decision tree (for every sufficiently long sequence of bits, the initial segment of the sequence will describe a unique decision tree), and each decision tree has a code. How does this code translate into a prior distribution on trees? In particular, how much does the likelihood of introducing a new split have to increase to offset the reduction in prior probability of the split (assuming that smaller trees are easier to describe than large trees in your code)?
 Exercise 7.12:
 Show how gradient descent can be used for learning a linear function that minimizes the absolute error. [Hint: Do a case analysis of the error. The error is differentiable at every point except when the error is zero, in which case it does not need to be updated.]
 Exercise 7.13:
 Give an example where a naive Bayesian classifier can give inconsistent results when using empirical frequencies as probabilities. [Hint: You require two features, say A and B, and a binary classification, say C, that has domain {0,1}. Construct a data set where the empirical probabilities give P(aC=0)=0 and P(bC=1)=0.] What observation is inconsistent with the model?
 Exercise 7.14:
 Run the AISpace.org neural network learner on the data of Figure 7.1.
 Suppose that you decide to use any predicted value from the neural network greater than 0.5 as true, and any value less than 0.5 as false. How many examples are misclassified initially? How many examples are misclassified after 40 iterations? How many examples are misclassified after 80 iterations?
 Try the same example and the same initial values, with different step sizes for the gradient descent. Try at least η=0.1, η=1.0, and η=5.0. Comment on the relationship between step size and convergence.
 Given the final parameter values you found, give a logical formula for what each of the units is computing. You can do this by considering, for each of the units, the truth tables for the input values and by determining the output for each combination, then reducing this formula. Is it always possible to find such a formula?
 All of the parameters were set to different initial values. What happens if the parameter values were all set to the same (random) value? Test it out for this example, and hypothesize what occurs in general.
 For the neural network algorithm, comment on the following stopping criteria:
 Learn for a limited number of iterations, where the limit is set initially.
 Stop when the sumofsquares error is less than 0.25. Explain why 0.25 may be an appropriate number.
 Stop when the derivatives all become within some ε of zero.
 Split the data into training data and test data, and stop when the error on the test data increases.
Which would you expect to better handle overfitting? Which criteria guarantee the gradient descent will stop? Which criteria would guarantee that, if it stops, the network can be used to predict the test data accurately?
 Exercise 7.15:
 In the neural net learning algorithm, the parameters are updated for each example. To compute the derivative accurately, the parameters should be updated only after all examples have been seen. Implement such a learning algorithm and compare it to the incremental algorithm, with respect to both rate of convergence and to speed of the algorithm.
 Exercise 7.16:

 Draw a kdtree for the data of Figure 7.1. The topmost feature to split on should be the one that most divides the examples into two equal classes. Assume that you know that the User Action feature does not appear in subsequent queries, and so it should not be split on. Show which training examples are at which leaf nodes.
 Show the locations in this tree that contain the closest training examples to a new case where the author is unknown, the thread is new, the length is long, and it was read at work.
 Based on this example, discuss which examples should be returned from a lookup on a kdtree. Why is this different from a lookup on a decision tree?
 Exercise 7.17:
 Implement a nearestneighbor learning system that stores the training examples in a kdtree and uses the neighbors that differ in the fewest number of features, weighted evenly. How well does this work in practice?