10 Learning with Uncertainty

The third edition of Artificial Intelligence: foundations of computational agents, Cambridge University Press, 2023 is now available (including full text).

10.7 Exercises

  1. 1.

    Try to construct an artificial example where a naive Bayes classifier can give divide-by-zero error in test cases when using empirical frequencies as probabilities. Specify the network and the (non-empty) training examples. [Hint: You can do it with two features, say A and B, and a binary classification, say C, that has domain {0,1}. Construct a data set where the empirical probabilities give P(a|C=0)=0 and P(b|C=1)=0.] What observation is inconsistent with the model?

  2. 2.

    Consider designing a help system based on Example 10.5. Discuss how your implementation can handle the following issues, and if it cannot whether it is a major problem.

    1. (a)

      What should be the initial uij counts? Where might this information be obtained?

    2. (b)

      What if the most likely page is not the correct page?

    3. (c)

      What if users cannot find the correct page?

    4. (d)

      What if users mistakenly think they have the correct page?

    5. (e)

      Can some pages never be found?

    6. (f)

      What should it do with common words that are independent of the help page?

    7. (g)

      What about words that affect the meaning, such as “not”?

    8. (h)

      How should it handle words it has never seen before?

    9. (i)

      How can new help pages be incorporated?

  3. 3.

    Suppose you have designed a help system based on Example 10.5 and much of the underlying system which the help pages are about has changed. You are now very unsure about which help pages will be requested, but you may have a good model of which words will be used given the help page. How can the help system be changed to take this into account? [Hint: You may need different counts for P(hi) and P(wjhi).]

  4. 4.

    Consider the unsupervised data of Figure 10.3.

    1. (a)

      How many different stable assignments of examples to classes does the k-means algorithm find when k=2? [Hint: Try running the algorithm on the data with a number of different starting points, but also think about what assignments of examples to classes are stable.] Do not count permutations of the labels as different assignments.

    2. (b)

      Estimate how many different stable assignments there are when k=3.

    3. (c)

      Estimate many different stable assignments are there when k=4.

    4. (d)

      Why might someone suggest that three is the natural number of classes in this example? Give a definition for “natural” number of classes, and use this data to justify the definition.

  5. 5.

    Suppose the k-means algorithm is run for an increasing sequence of values for k, and that it is run for a number of times for each k to find the assignment with a global minimum error. Is it possible that a number of values of k exist for which the error plateaus and then has a large improvement (e.g., when the error for k=3, k=4, and k=5 are about the same, but the error for k=6 is much lower)? If so, give an example. If not, explain why.

  6. 6.

    To initialize the EM algorithm in Figure 10.6 consider two alternatives:

    1. (a)

      allow P to return a random distribution the first time through the loop

    2. (b)

      initialize cc and fc to random values

    By running the algorithm on some data sets, determine which, if any, of these alternatives is better in terms of log loss of the training data, as a function of the number of loops through the data set. Does it matter if cc and fc are not consistent with the semantics (counts that should be equal are not)?

  7. 7.

    As outlined in Example 10.7, define a code for describing decision trees. Make sure that each code corresponds to a decision tree (for every sufficiently long sequence of bits, the initial segment of the sequence will describe a unique decision tree), and each decision tree has a code. How does this code translate into a prior distribution on trees? In particular, how much does the likelihood of introducing a new split have to increase to offset the reduction in prior probability of the split (assuming that smaller trees are easier to describe than large trees in your code)?