foundations of computational agents
The aim of this exercise is to prove and extend the table of Figure 7.5.
Prove the optimal predictions for training data of Figure 7.5. To do this, find the minimum value of the absolute error, the squared error, the log loss, and the value that gives the maximum likelihood. The maximum or minimum value is either at an end point or where the derivative is zero. [Hints: For squared error and log loss, take the derivative and set to zero. For the absolute error, consider how the error changes when moving from one data point to the next (in order). It might help to consider first the case where the data points are at consecutive integers, and then generalize.]
To determine the best prediction for the test data, assume that the data cases are generated stochastically according to some true parameter $p$. [See the thought experiment.] Try the following for a number of different values for $p$ selected uniformly in the range $[0,1]$. Generate $n$ training examples (try various values for $n$, some small, say 2, 3, or 5, and some large, say 1000) by sampling with probability ${p}_{0}$ from these training examples. Let ${n}_{0}$ be the number of false examples and ${n}_{1}$ be the number of true examples in the training set (so ${n}_{0}+{n}_{1}=n$). Generate a test set of size, say, 20, that contains many test cases using the same parameter $p$. Repeat this for 1000 times to get a reasonable average. Which of the following gives a lower error on the test set for each of the optimality criteria: (I) absolute error; (II) squared error; and (III) log loss.
The mode.
${n}_{1}/n$.
If ${n}_{1}=0$, use $0.001$, if ${n}_{0}=0$, use $0.999$, else use ${n}_{1}/n$. Try this for different numbers when the counts are zero.
$({n}_{1}+1)/(n+2)$
$({n}_{1}+\alpha )/(n+2\alpha )$ for different values of $\alpha >0$.
Another predictor that is a function of ${n}_{0}$ and ${n}_{1}$.
(For the mathematically sophisticated, try to prove what the optimal predictor is for each criterion.)
In the context of a point estimate of a feature with domain $\{0,1\}$ with no inputs, it is possible for an agent to make a stochastic prediction with a parameter $p\in [0,1]$ such that the agent predicts 1 with probability $p$ and predicts 0 otherwise. For each of the following error measures, give the expected error on a training set with ${n}_{0}$ occurrences of 0 and ${n}_{1}$ occurrences of 1 (as a function of $p$). What is the value of $p$ that minimizes the error? Is this worse or better than the prediction of Figure 7.5?
Mean absolute loss.
Mean squared loss.
Worst-case error.
Prove that for any two predictors $A$ and $B$ on the same dataset, if $A$ dominates $B$, that is, $A$ has a higher true-positive rate and lower false-positive rate than $B$, then $A$ has a lower cost (and so is better) than $B$ for all cost functions that depend only on the number of false positives and false negatives (assuming the costs to be minimized are non-negative). [Hint: Prove that the conditions of the statement imply the number of false negatives and the number of false positives is less (or, perhaps, equal), which in turn implies the conclusion.]
Consider the predictors (a) and (c) in Figure 7.7.
Which of (a) and (c) has a better recall (true-positive rate)? (See Example 7.7.)
What is the precision of (a)? (Give both a ratio and the decimal value to 3 decimal points.)
What is the precision of (c)?
Which of (a) and (c) is better in terms of precision?
Suppose false positives were 1000 times worse than false negatives, which of (a) and (c) would have a lower cost? (Consider the cost of a false negative to be 1 and a false positive to be 1000.)
Suppose false negatives were 1000 times worse than false positives, which of (a) and (c) would have a lower cost?
Suppose you need to define a system that, given data about a person’s TV-watching likes, recommends other TV shows the person may like. Each show has features specifying whether it is a comedy, whether it features doctors, whether it features lawyers, and whether it has guns. You are given the fictitious examples of Figure 7.23 about whether the person likes various TV shows.
Example | $Comedy$ | $Doctors$ | $Lawyers$ | $Guns$ | $Likes$ |
---|---|---|---|---|---|
${e}_{1}$ | $false$ | $true$ | $false$ | $false$ | $false$ |
${e}_{2}$ | $true$ | $false$ | $true$ | $false$ | $true$ |
${e}_{3}$ | $false$ | $false$ | $true$ | $true$ | $true$ |
${e}_{4}$ | $false$ | $false$ | $true$ | $false$ | $false$ |
${e}_{5}$ | $false$ | $false$ | $false$ | $true$ | $false$ |
${e}_{6}$ | $true$ | $false$ | $false$ | $true$ | $false$ |
${e}_{7}$ | $true$ | $false$ | $false$ | $false$ | $true$ |
${e}_{8}$ | $false$ | $true$ | $true$ | $true$ | $true$ |
${e}_{9}$ | $false$ | $true$ | $true$ | $false$ | $false$ |
${e}_{10}$ | $true$ | $true$ | $true$ | $false$ | $true$ |
${e}_{11}$ | $true$ | $true$ | $false$ | $true$ | $false$ |
${e}_{12}$ | $false$ | $false$ | $false$ | $false$ | $false$ |
We want to use this dataset to learn the value of $Likes$ (i.e., to predict which TV shows the person would like based on the attributes of the TV show).
This is designed to be small enough to do it manually, however you may find the AIPython (aipython.org) code or useful to check your answers.
Suppose the error is the sum of absolute errors. Give the optimal decision tree with only one node (i.e., with no splits). What is the error of this tree?
Do the same as in part (a), but with the squared error.
Suppose the error is the sum of absolute errors. Give the optimal decision tree of depth 2 (i.e., the root node is the only node with children). For each leaf in the tree, give the examples that are filtered to that node. What is the error of this tree?
Do the same as in part (c) but with the squared error.
What is the smallest tree that correctly classifies all training examples? Does a top-down decision tree that optimizes the information gain at each step represent the same function?
Give two instances not appearing in the examples of Figure 7.23 and show how they are classified using the smallest decision tree. Use this to explain the bias inherent in the tree. (How does the bias give you these particular predictions?)
Is this dataset linearly separable? Explain why or why not.
Consider the decision tree learning algorithm of Figure 7.9 and the data of Figure 7.1. Suppose, for this question, the stopping criterion is that all of the examples have the same classification. The tree of Figure 7.8 was built by selecting a feature that gives the maximum information gain. This question considers what happens when a different feature is selected.
Suppose you change the algorithm to always select the first element of the list of features. What tree is found when the features are in the order $[Author$, $Thread$, $Length$, $WhereRead]$? Does this tree represent a different function than that found with the maximum information gain split? Explain.
What tree is found when the features are in the order $[WhereRead$, $Thread$, $Length$, $Author]$? Does this tree represent a different function than that found with the maximum information gain split or the one given for the preceding part? Explain.
Is there a tree that correctly classifies the training examples but represents a different function than those found by the preceding algorithms? If so, give it. If not, explain why.
In the decision tree learner of Figure 7.9, it is possible to mix the leaf predictions (what is returned by $leaf\mathrm{\_}value$) and which loss is used in $sum\mathrm{\_}loss$. For each loss in the set $\{$0–1 loss, absolute loss, squared loss, log loss$\}$, and for each leaf choice in $\{$empirical distribution, mode, median$\}$, build a tree to greedily optimize the loss when choosing the split, and use the leaf choice. For each tree, give the number of leaves and the evaluation of a test set for each loss. Try this for at least two datasets.
Which split choice gives the smallest tree?
Is there a loss that is consistently improved by optimizing a different loss when greedily choosing a split?
Try to find a different leaf choice that would be better for some optimization criterion.
For each optimization criterion, which combination of split choice and leaf choice has the best performance on the test set?
The aim of this exercise is to determine the size of the space of decision trees. Suppose there are $n$ binary features in a learning problem. How many different decision trees are there? How many different functions are represented by these decision trees? Is it possible that two different decision trees give rise to the same function?
Implement a decision tree learner that handles input features with ordered domains. You can assume that any numerical feature is ordered. The condition should be a cut on a single variable, such as $X\le v$, which partitions the training examples according to the value $v$. A cut-value can be chosen for a feature $X$ by sorting the examples on the value of $X$, and sweeping through the examples in order. While sweeping through the examples, the evaluation of each partition should be computed from the evaluation of the previous partition. Does this work better than, for example, selecting the cuts arbitrarily?
Give the weights for a logistic regression model that can approximate the following logical operations. Assume true is represented as 1, and false as 0. Assume that $sigmoid(5)$ is a close enough approximation to 1, $sigmoid(-5)$ is close enough to 0.
and: ${x}_{1}\wedge {x}_{2}$
or: ${x}_{1}\vee {x}_{2}$
negation: $\neg {x}_{1}$
nand: $\neg ({x}_{1}\wedge {x}_{2})$. Note that nand (not and) is of interest because all Boolean functions can be built using nand.
Use ${w}_{0}$ for the bias term (the weight multiplied by 1) and ${w}_{i}$ is the weight for ${x}_{i}$.
Show how gradient descent can be used for learning a linear function that minimizes the absolute error. [Hint: Do a case analysis of the error; for each example the absolute value is either the positive or the negative of the value. What is appropriate when the value is zero?]
Consider minimizing Equation 7.1, which gives the error of a linear prediction. This can be solved by finding the zero(s) of its derivative. The general case involves solving linear equations, for which there are many techniques, but it is instructive to do a simple case by hand.
Give a formula for the weights that minimize the error for the case where $n=2$ (i.e., when there are only two input features). [Hint: For each weight, differentiate with respect to that weight and set to zero. Solve the resulting equations.]
Write pseudocode to implement this.
Why is it hard to minimize the error analytically when using a sigmoid function as an activation function, for $n=2$? (Why doesn’t the same method as in part (a) work?)
Suppose you want to optimize the mean squared loss for the sigmoid of a linear function.
Modify the algorithm of Figure 7.12 so that the update is proportional to the gradient of the squared error. Note that this question assumes you know differential calculus, in particular, the chain rule for differentiation.
Does this work better than the algorithm that minimizes log loss when evaluated according to the squared error?
Consider how to estimate the quality of a restaurant from the ratings of 1 to 5 stars as in Example 7.18.
What would this predict for a restaurant that has two 5-star ratings? How would you test from the data whether this is a reasonable prediction?
Suppose you wanted not to optimize just for the 5-star restaurants, but for all restaurants. How can this be done?
Can $c$, as computed in Example 7.18, be negative? Give a scenario where this might occur.
Why might we choose not to use the average rating for ${a}_{0}$? What else might you use? [Hints: A new restaurant may be quite different from a well-established restaurant. Picking a restaurant at random, and then a rating at random, will have a different average than picking a rating at random.]
It might be useful to try this on some real data. For example, Movielens makes a dataset of movie ratings available, which can be used to test such hypotheses (albeit on movies, not restaurants).
It is possible to define a regularizer to minimize ${\sum}_{e}(loss(\widehat{Y}(e),Y(e))+\lambda \ast regularizer(\widehat{Y}))$ rather than Formula 7.5. How is this different than the existing regularizer? [Hint: Think about how this affects multiple datasets or for cross validation.]
Suppose $\lambda $ is set by $k$-fold cross validation, and then the model is learned for the whole dataset. How would the algorithm be different for the original way(s) of defining a regularizer and this alternative way? [Hint: There is a different number of examples used for the regularization than there is the full dataset; does this matter?] Which works better in practice?
Given the parameterizations of Example 7.22:
When the features are $a$, $b$, and $c$, what decision tree will the decision-tree learning algorithm find to represent $t$ (assuming it maximizes information gain and only stops when all examples at a leaf agree)?
When the features are $a$, $b$, and $c$, what is a smallest decision tree that can represent $t$? How does this compare to using $x$, $y$, and $z$.
How can $x$, $y$, and $z$ be defined in terms of $a$, $b$, and $c$?