7 Supervised Machine Learning

7.11 Exercises

  1. 1.

    The aim of this exercise is to prove and extend the table of Figure 7.5.

    1. (a)

      Prove the optimal predictions for training data of Figure 7.5. To do this, find the minimum value of the absolute error, the sum-of-squares error, the entropy, and the value that gives the maximum likelihood. The maximum or minimum value is either an end point or where the derivative is zero.

    2. (b)

      To determine the best prediction for the test data, assume that the data cases are generated stochastically according to some true parameter p0. Try the following for a number of different values for p0[0,1]. Generate k training examples (try various values for k, some small, say 2, 3 or 5, and some large, say 1000) by sampling with probability p0; from these training examples generate n0 and n1. Generate a test set that contains many test cases using the same parameter p0. Which of the following gives a lower error on the test set for each of the optimality criteria: sum of absolute values, sum of squares, and likelihood (or log loss)

      1. i.

        the mode

      2. ii.

        n1/(n0+n1)

      3. iii.

        if n1=0, use 0.001, if n0=0, use 0.999, else use n1/(n0+n1). Try this for different numbers when the counts are zero.

      4. iv.

        (n1+1)/(n0+n1+2)

      5. v.

        (n1+α)/(n0+n1+2α) for different values of α>0

      6. vi.

        another predictor that is a function of n0 and n1.

      You may have to generate many different training sets for each parameter. (For the mathematically sophisticated, try to prove what the optimal predictor is for each criterion.)

  2. 2.

    In the context of a point estimate of a feature with domain {0,1} with no inputs, it is possible for an agent to make a stochastic prediction with a parameter p[0,1] such that the agent predicts 1 with probability p and predicts 0 otherwise. For each of the following error measures, give the expected error on a training set with n0 occurrences of 0 and n1 occurrences of 1 (as a function of p). What is the value of p that minimizes the error? Is this worse or better than the prediction of Figure 7.5?

    1. (a)

      Sum of absolute errors

    2. (b)

      Sum-of-squares error

    3. (c)

      Worst-case error

  3. 3.

    Suppose we have a system that observes a person’s TV watching habits in order to recommend other TV shows the person may like. Suppose that we have characterized each show by whether it is a comedy, whether it features doctors, whether it features lawyers, and whether it has guns. Suppose we are given the examples of Figure 7.22 about whether the person likes various TV shows.

    Example Comedy Doctors Lawyers Guns Likes
    e1 false true false false false
    e2 true false true false true
    e3 false false true true true
    e4 false false true false false
    e5 false false false true false
    e6 true false false true false
    e7 true false false false true
    e8 false true true true true
    e9 false true true false false
    e10 true true true false true
    e11 true true false true false
    e12 false false false false false
    Figure 7.22: Training examples for Exercise 3

    We want to use this data set to learn the value of Likes (i.e., to predict which TV shows the person would like based on the attributes of the TV show).

    You may find the AIspace.org applets useful for this assignment. (Before you start, see if you can see the pattern in what shows the person likes.)

    1. (a)

      Suppose the error is the sum of absolute errors. Give the optimal decision tree with only one node (i.e., with no splits). What is the error of this tree?

    2. (b)

      Do the same as in part (a), but with the sum-of-squares error.

    3. (c)

      Suppose the error is the sum of absolute errors. Give the optimal decision tree of depth 2 (i.e., the root node is the only node with children). For each leaf in the tree, give the examples that are filtered to that node. What is the error of this tree?

    4. (d)

      Do the same as in part (c) but with the sum-of-squares error.

    5. (e)

      What is the smallest tree that correctly classifies all training examples? Does a top-down decision tree that optimizes the information gain at each step represent the same function?

    6. (f)

      Give two instances not appearing in the examples of Figure 7.22 and show how they are classified using the smallest decision tree. Use this to explain the bias inherent in the tree. (How does the bias give you these particular predictions?)

    7. (g)

      Is this data set linearly separable? Explain why or why not.

  4. 4.

    Consider the decision tree learning algorithm of Figure 7.7 and the data of Figure 7.1. Suppose, for this question, the stopping criterion is that all of the examples have the same classification. The tree of Figure 7.6 was built by selecting a feature that gives the maximum information gain. This question considers what happens when a different feature is selected.

    1. (a)

      Suppose you change the algorithm to always select the first element of the list of features. What tree is found when the features are in the order [Author, Thread, Length, WhereRead]? Does this tree represent a different function than that found with the maximum information gain split? Explain.

    2. (b)

      What tree is found when the features are in the order [WhereRead, Thread, Length, Author]? Does this tree represent a different function than that found with the maximum information gain split or the one given for the preceding part? Explain.

    3. (c)

      Is there a tree that correctly classifies the training examples but represents a different function than those found by the preceding algorithms? If so, give it. If not, explain why.

  5. 5.

    The aim of this exercise is to determine the size of the space of decision trees. Suppose there are n binary features in a learning problem. How many different decision trees are there? How many different functions are represented by these decision trees? Is it possible that two different decision trees give rise to the same function?

  6. 6.

    Extend the decision tree learning algorithm of Figure 7.7 to allow for multiway splits for discrete variables. Assume that the inputs are the input features, the target feature and the training examples. A split is on the values of a feature.

    One problem that must be overcome is when no examples correspond to one particular value of a selected feature. You must make a reasonable prediction for this case.

  7. 7.

    Implement a decision tree learner that handles input features with ordered domains. You can assume that any numerical feature is ordered. The condition should be a cut on a single variable, such as Xv, which partitions the training examples according to the value v. A cut-value can be chosen for a feature X by sorting the examples on the value of X, and sweeping through the examples in order. While sweeping through the examples, the evaluation of each partition should be computed from the evaluation of the previous partition. Does this work better than, for example, selecting the cuts arbitrarily?

  8. 8.

    The decision tree learning algorithm of Figure 7.7 has to stop if it runs out of features and not all examples agree.

    Suppose that you are building a decision tree for a Boolean target feature and you have come to the stage where there are no remaining input features to split on and there are examples in the training set, n1 of which are positive and n0 of which are negative. Consider the strategies

    1. (a)

      Return whichever value has the most examples – return true if n1>n0, false if n1<n0, and either if n1=n0.

    2. (b)

      Return the empirical frequency, n1/(n0+n1).

    3. (c)

      Return (n1+1)/(n0+n1+2).

    For each of the following objectives predict which of the strategies will have the smallest error on a test set.

    1. (a)

      Minimize the sum of the absolute differences between the value of the example (1=true and 0=false) and the predicted values in the tree (either 1=true and 0=false or the probability).

    2. (b)

      Minimize the sum of the squares of the differences in values.

    3. (c)

      Maximize the log-likelihood of the data.

    Explain your predictions, test it on some data sets, and report as to whether your prediction holds.

  9. 9.

    Show how gradient descent can be used for learning a linear function that minimizes the absolute error. [Hint: Do a case analysis of the error; for each example the absolute value is either the positive or the negative of the value. What is appropriate when the value is zero?]

  10. 10.

    Consider Equation 7.1, which gives the error of a linear prediction.

    1. (a)

      Give a formula for the weights that minimize the error for the case where n=2 (i.e., when there are only two input feature). [Hint: For each weight, differentiate with respect to that weight and set to zero.]

    2. (b)

      Why is it hard to minimize the error analytically when using a sigmoid function as an activation function, for n=2? (Why doesn’t the same method as in part (a) work?)

  11. 11.

    Suppose you want to optimize the sum-of-squares error for the sigmoid of a linear function.

    1. (a)

      Modify the algorithm of Figure 7.8 so that the update is proportional to the gradient of the sum-of-suares error. Note that this question assumes you know differential calculus, in particular, the chain rule for differentiation.

    2. (b)

      Does this work better than the algorithm that minimizes log loss when evaluated according to the sum-of-squares error?

  12. 12.

    Consider how to estimate the quality of restaurant from the ratings of 1 to 5 stars as in Example 7.17.

    1. (a)

      What would this predict for a restaurant that has two 5-star ratings? How would you test from the data whether this is a reasonable prediction?

    2. (b)

      Suppose you wanted not to optimize just for the 5-star restaurants, but for all restaurants. How can this be done?

    3. (c)

      Can c, as computed in Example 7.17, be negative? Give a scenario where this might occur.

    4. (d)

      Why might we choose not to use the average rating for a0? What else might you use? [Hint: A new restaurant may be quite different from a well-established restaurant. Hint: Picking a restaurant at random, and then a rating at random will have a different average than picking a rating at random.]

    It might be useful to try this on some real data. For example, Movielens makes a data set of movie ratings available, which can be used to test such hypotheses (albeit on movies, not restaurants).

  13. 13.

    Consider the update step for the L2 regularizer for linear or logistic regression. It is possible to update the weights due to the regularization inside the “for each” loops in Figure 7.8. Does this actually minimize the ridge regression formula? How must λ be modified? Does this work better in practice?

  14. 14.

    Suggest how the update for the L1 regularizer could be carried out once per example rather than after all examples have been considered. Which do you think would work better, and why? Does this work better or worse in practice than updating once per example?

  15. 15.

    It is possible to define a regularizer to minimize e(errorh(e)+λ*regularizerh) rather than Formula 7.4. How is this different than the existing regularizer? [Hint: Think about how this affects multiple data sets or for cross validation.]

    Suppose λ is set by k-fold cross validation, and then the model is learned for the whole data set. How would the algorithm be different for the original way(s) of defining a regularizer and this alternative way? [Hint: There is a different number of examples used for the regularization than there is the full data set; does this matter?] Which works better in practice?

  16. 16.

    Consider the parameters learned for the neural network in Example 7.20. Give a logical formula (or a decision tree) representing the Boolean function that is the value for the hidden units and the output units. This formula should not refer to any real numbers. [Suppose that, in the output of a neural network, any value greater than 0.5 is considered true and any less than 0.5 is false (i.e., any positive value before the activation function is true, and a negative value is false). Also consider whether intermediate values of the hidden units matter.]

    [Hint: One brute-force method is to go through the 16 combinations of values for the inputs to each hidden unit and determine the truth value of the output. A better method is to try to understand the functions themselves.]

    Does the neural network learn the same function as a decision tree with classifications at the leaves? If not, what is the smallest decision tree that represents the same function?

  17. 17.

    Run the AIspace.org neural network learner on the data of Figure 7.1.

    1. (a)

      Suppose that you decide to use any predicted value from the neural network greater than 0.5 as true, and any value less than 0.5 as false. How many examples are misclassified initially? How many examples are misclassified after 40 iterations? How many examples are misclassified after 80 iterations?

    2. (b)

      Try the same example and the same initial values, with different step sizes for the gradient descent. Try at least η=0.1, η=1.0, and η=5.0. Comment on the relationship between step size and convergence.

    3. (c)

      Given the final parameter values you found, give a logical formula for what each of the units is computing. [Hint: As a brute-force method, for each of the units, build the truth tables for the input values and determine the output for each combination, then simplify the resulting formula]. Is it always possible to find such a formula?

    4. (d)

      All of the parameters were set to different initial values. What happens if the parameter values are all set to the same (random) value? Test it out for this example, and hypothesize what occurs in general.

    5. (e)

      For the neural network algorithm, comment on the following stopping criteria.

      1. i.

        Learn for a limited number of iterations, where the limit is set initially.

      2. ii.

        Stop when the sum-of-squares error is less than 0.25. Explain why 0.25 may be an appropriate number.

      3. iii.

        Stop when the derivatives all become within some ϵ of zero.

      4. iv.

        Split the data into training data and validation data, train on the training data and stop when the error on the validation data increases.

      Which would you expect to better handle overfitting? Which criteria guarantee the gradient descent will stop? Which criteria would guarantee that, if it stops, the network can be used to predict the test data accurately?

  18. 18.

    In the neural net learning algorithm, the parameters are updated for each example. To compute the derivative accurately, the parameters should be updated only after all examples have been seen. Implement such a learning algorithm and compare it to the incremental algorithm, with respect to both rate of convergence and to speed of the algorithm.

  19. 19.

    How does a neural network with hidden units as rectified linear units (f(z)=max(0,z)) compare to a neural network with sigmoid hidden units? This should be tested on more than one data set. Make sure that the output unit is appropriate for the data set(s).

  20. 20.
    1. (a)

      Draw a kd-tree for the data of Figure 7.1. The topmost feature to split on should be the one that most divides the examples into two equal classes. (Do not split on UserAction.) Show which training examples are at which leaf nodes.

    2. (b)

      Show the locations in this tree that contain the closest training examples to a new case where the author is unknown, the thread is new, the length is long, and it was read at work.

    3. (c)

      Based on this example, discuss which examples should be returned from a lookup on a kd-tree. Why is this different from a lookup on a decision tree?

  21. 21.

    Implement a nearest-neighbor learning system that stores the training examples in a kd-tree and uses the neighbors that differ in the fewest number of features, weighted evenly. How well does this work in practice?