foundations of computational agents
Try to construct an artificial example where a naive Bayes classifier can give divide-by-zero error in test cases when using empirical frequencies as probabilities. Specify the network and the (non-empty) training examples. [Hint: You can do it with two features, say $A$ and $B$, and a binary classification, say $C$, that has domain $\{0,1\}$. Construct a dataset where the empirical probabilities give $P(a|C=0)=0$ and $P(b|C=1)=0$.] What observation is inconsistent with the model?
Consider designing a help system based on Example 10.5. Discuss how your implementation can handle the following issues, and if it cannot, whether it is a major problem.
What should be the initial ${u}_{ij}$ counts? Where might this information be obtained?
What if the most likely page is not the correct page?
What if users cannot find the correct page?
What if users mistakenly think they have the correct page?
Can some pages never be found?
What should it do with common words that are independent of the help page?
What about words that affect the meaning, such as “not”?
How should it handle words it has never seen before?
How can new help pages be incorporated?
Consider the help system of Example 10.5.
Using the ${c}_{1}$ and ${w}_{ij}$ counts in that example, give the probability of $P(H\mid q)$, the distribution over help pages given $q$, the set of words in a user query. Note that this probability needs to also depend on the words not in $q$.
How can this be computed in time proportional to the number of words in $q$, rather than the size of the vocabulary. [Hint: Consider the probability of $H$ given no words were in the query – which is not the prior of $H$ – and then adjust for the words in the query.]
Suppose you have designed a help system based on Example 10.5 and much of the underlying system which the help pages are about has changed. You are now very unsure about which help pages will be requested, but you may have a good model of which words will be used given the help page. How can the help system be changed to take this into account? [Hint: You may need different counts for $P({h}_{i})$ and $P({w}_{j}\mid {h}_{i})$.]
Consider the unsupervised data of Figure 10.5.
How many different stable assignments of examples to classes does the $k$-means algorithm find when $k=2$? [Hint: Try running the algorithm on the data with a number of different starting points, but also think about what assignments of examples to classes are stable.] Do not count permutations of the labels as different assignments.
Estimate how many different stable assignments there are when $k=3$.
Estimate many different stable assignments there are when $k=4$.
Why might someone suggest that three is the natural number of classes in this example? Give a definition for “natural” number of classes, and use this data to justify the definition.
Suppose the $k$-means algorithm is run for an increasing sequence of values for $k$, and that it is run for a number of times for each $k$ to find the assignment with a global minimum error. Is it possible that a number of values of $k$ exist for which the error plateaus and then has a large improvement (e.g., when the errors for $k=3$, $k=4$, and $k=5$ are about the same, but the error for $k=6$ is much lower)? If so, give an example. If not, explain why.
To initialize the EM algorithm in Figure 10.8 consider two alternatives:
allow $P$ to return a random distribution the first time through the loop
initialize $cc$ and $fc$ to random values.
By running the algorithm on some datasets, determine which, if any, of these alternatives is better in terms of log loss of the training data, as a function of the number of loops through the dataset. Does it matter if $cc$ and $fc$ are not consistent with the semantics (counts that should be equal are not)?
Consider the code for decision trees in Example 10.7, and the Bayesian information criteria (BIC) for decision trees. Consider the three cases: the BIC, the decision tree code with a 32-bit representation for probabilities, and the decision tree code that uses ${\mathrm{log}}_{2}(|E|)$ bits to represent a probability.
For each case, how many extra bits does introducing a split incur?
Which method has the biggest preference for smaller trees?
For each of the three methods, is there a value of $\gamma $ in the decision tree learner (Figure 7.9) that corresponds to that method? If so, give it, if not, why not?