11.6 Exercises
 Exercise 11.1:
 Consider the unsupervised data of Figure 11.1.
 How many different stable assignments of examples to classes does the kmeans algorithm find when k=2? [Hint: Try running the algorithm on the data with a number of different starting points, but also think about what assignments of examples to classes are stable.] Do not count permutations of the labels as different assignments.
 How many different stable assignments are there when k=3?
 How many different stable assignments are there when k=4?
 Why might someone suggest that three is the natural number of classes in this example? Give a definition for "natural" number of classes, and use this data to justify the definition.
 Exercise 11.2:
 Suppose the kmeans algorithm is run for an increasing sequence of values for k, and that it is run for a number of times for each k to find the assignment with a global minimum error. Is it possible that a number of values of k exist for which the error plateaus and then has a large improvement (e.g., when the error for k=3, k=4, and k=5 are about the same, but the error for k=6 is much lower)? If so, give an example. If not, explain why.
 Exercise 11.3:
 Give an algorithm for EM for unsupervised learning [Figure 11.4] that does not store an A array, but rather recomputes the appropriate value for the M step. Each iteration should only involve one sweep through the data set. [Hint: For each tuple in the data set, update all of the relevant M_{i}values.]
 Exercise 11.4:
 Suppose a Qlearning agent, with fixed α and discount γ, was in state 34, did action 7, received reward 3, and ended up in state 65. What value(s) get updated? Give an expression for the new value. (Be as specific as possible.)
 Exercise 11.5:
 Explain what happens in reinforcement learning if the agent always chooses the action that maximizes the Qvalue. Suggest two ways to force the agent to explore.
 Exercise 11.6:
 Explain how Qlearning fits in with the agent
architecture of Section 2.2.1. Suppose that the Qlearning agent has discount factor γ,
a step size of α, and is carrying out an εgreedy
exploration strategy.
 What are the components of the belief state of the Qlearning agent?
 What are the percepts?
 What is the command function of the Qlearning agent?
 What is the beliefstate transition function of the Qlearning agent?
 Exercise 11.7:
 For the plot of the total reward as a function of time as in Figure 11.12, the minimum and zero crossing are only meaningful statistics when balancing positive and negative rewards is reasonable behavior. Suggest what should replace these statistics when zero is not an appropriate definition of reasonable behavior. [Hint: Think about the cases that have only positive reward or only negative reward.]
 Exercise 11.8:

Compare the different parameter settings for the game of
Example 11.8.
In particular compare the following situations:
 α varies, and the Qvalues are initialized to 0.0.
 α varies, and the Qvalues are initialized to 5.0.
 α is fixed to 0.1, and the Qvalues are initialized to 0.0.
 α is fixed to 0.1, and the Qvalues are initialized to 5.0.
 Some other parameter settings.
For each of these, carry out multiple runs and compare the distributions of minimum values, zero crossing, the asymptotic slope for the policy that includes exploration, and the asymptotic slope for the policy that does not include exploration. To do the last task, after the algorithm has converged, set the exploitation parameter to 100% and run a large number of additional steps.
 Exercise 11.9:
 Consider four different ways to derive the value of α_{k}
from k in Qlearning (note that for Qlearning with varying
α_{k}, there must be a different count
k for each stateaction pair).
 Let α_{k} = 1/k.
 Let α_{k} = 10/(9+k).
 Let α_{k} = 0.1.
 Let α_{k} = 0.1 for the first 10,000 steps, α_{k} = 0.01 for the next 10,000 steps, α_{k} = 0.001 for the next 10,000 steps, α_{k} = 0.0001 for the next 10,000 steps, and so on.
 Which of these will converge to the true Qvalue in theory?
 Which converges to the true Qvalue in practice (i.e., in a reasonable number of steps)? Try it for more than one domain.
 Which can adapt when the environment adapts slowly?
 Exercise 11.10:
 Suppose your friend presented you with the following example where
SARSA(λ) seems to give unintuitive results.
There are two states, A and B. There is a reward of
10 coming into state A and no other rewards or penalties. There are
two actions: left and right. These actions only make a difference in
state B. Going left in state B goes directly to state A, but
going right has a low probability of going into state A. In particular:
 P(AB, left) = 1; reward is 10.
 P(AB, right) = 0.01; reward is 10. P(BB, right) = 0.99; reward is 0.
 P(AA, left) = P(AA, right) = 0.999 and P(BA, left) = P(BA, right) = 0.001. This is small enough that the eligibility traces will be close enough to zero when state B is entered.
 γ and λ are 0.9 and α is 0.4.
Suppose that your friend claimed that that Q(λ) does not work in this example, because the eligibility trace for the action right in state B ends up being bigger than the eligibility trace for action left in state B and the rewards and all of the parameters are the same. In particular, the eligibility trace for action right will be about 5 when it ends up entering state A, but it be 1 for action left. Therefore, the best action will be to go right in state B, which is not correct.
What is wrong with your friend's argument? What does this example show?
 Exercise 11.11:
 In SARSA with linear function approximators, if you use linear regression to minimize r+γQ_{w}(s',a')Q_{w}(s,a), you get a different result than we have here. Explain what you get and why what is described in the text may be preferable (or not).
 Exercise 11.12:
 In Example 11.14, some of the features are perfectly correlated (e.g., F_{6} and F_{7}). Does having such correlated features affect what functions can be represented? Does it help or hurt the speed at which learning occurs?