## 11.6 Exercises

Exercise 11.1:
Consider the unsupervised data of Figure 11.1.
1. How many different stable assignments of examples to classes does the k-means algorithm find when k=2? [Hint: Try running the algorithm on the data with a number of different starting points, but also think about what assignments of examples to classes are stable.] Do not count permutations of the labels as different assignments.
2. How many different stable assignments are there when k=3?
3. How many different stable assignments are there when k=4?
4. Why might someone suggest that three is the natural number of classes in this example? Give a definition for "natural" number of classes, and use this data to justify the definition.
Exercise 11.2:
Suppose the k-means algorithm is run for an increasing sequence of values for k, and that it is run for a number of times for each k to find the assignment with a global minimum error. Is it possible that a number of values of k exist for which the error plateaus and then has a large improvement (e.g., when the error for k=3, k=4, and k=5 are about the same, but the error for k=6 is much lower)? If so, give an example. If not, explain why.
Exercise 11.3:
Give an algorithm for EM for unsupervised learning [Figure 11.4] that does not store an A array, but rather recomputes the appropriate value for the M step. Each iteration should only involve one sweep through the data set. [Hint: For each tuple in the data set, update all of the relevant Mi-values.]
Exercise 11.4:
Suppose a Q-learning agent, with fixed α and discount γ, was in state 34, did action 7, received reward 3, and ended up in state 65. What value(s) get updated? Give an expression for the new value. (Be as specific as possible.)
Exercise 11.5:
Explain what happens in reinforcement learning if the agent always chooses the action that maximizes the Q-value. Suggest two ways to force the agent to explore.
Exercise 11.6:
Explain how Q-learning fits in with the agent architecture of Section 2.2.1. Suppose that the Q-learning agent has discount factor γ, a step size of α, and is carrying out an ε-greedy exploration strategy.
1. What are the components of the belief state of the Q-learning agent?
2. What are the percepts?
3. What is the command function of the Q-learning agent?
4. What is the belief-state transition function of the Q-learning agent?
Exercise 11.7:
For the plot of the total reward as a function of time as in Figure 11.12, the minimum and zero crossing are only meaningful statistics when balancing positive and negative rewards is reasonable behavior. Suggest what should replace these statistics when zero is not an appropriate definition of reasonable behavior. [Hint: Think about the cases that have only positive reward or only negative reward.]
Exercise 11.8:
Compare the different parameter settings for the game of Example 11.8. In particular compare the following situations:
1. α varies, and the Q-values are initialized to 0.0.
2. α varies, and the Q-values are initialized to 5.0.
3. α is fixed to 0.1, and the Q-values are initialized to 0.0.
4. α is fixed to 0.1, and the Q-values are initialized to 5.0.
5. Some other parameter settings.

For each of these, carry out multiple runs and compare the distributions of minimum values, zero crossing, the asymptotic slope for the policy that includes exploration, and the asymptotic slope for the policy that does not include exploration. To do the last task, after the algorithm has converged, set the exploitation parameter to 100% and run a large number of additional steps.

Exercise 11.9:
Consider four different ways to derive the value of αk from k in Q-learning (note that for Q-learning with varying αk, there must be a different count k for each state-action pair).
1. Let αk = 1/k.
2. Let αk = 10/(9+k).
3. Let αk = 0.1.
4. Let αk = 0.1 for the first 10,000 steps, αk = 0.01 for the next 10,000 steps, αk = 0.001 for the next 10,000 steps, αk = 0.0001 for the next 10,000 steps, and so on.
1. Which of these will converge to the true Q-value in theory?
2. Which converges to the true Q-value in practice (i.e., in a reasonable number of steps)? Try it for more than one domain.
Exercise 11.10:
Suppose your friend presented you with the following example where SARSA(λ) seems to give unintuitive results. There are two states, A and B. There is a reward of 10 coming into state A and no other rewards or penalties. There are two actions: left and right. These actions only make a difference in state B. Going left in state B goes directly to state A, but going right has a low probability of going into state A. In particular:
• P(A|B, left) = 1; reward is 10.
• P(A|B, right) = 0.01; reward is 10. P(B|B, right) = 0.99; reward is 0.
• P(A|A, left) = P(A|A, right) = 0.999 and P(B|A, left) = P(B|A, right) = 0.001. This is small enough that the eligibility traces will be close enough to zero when state B is entered.
• γ and λ are 0.9 and α is 0.4.

Suppose that your friend claimed that that Q(λ) does not work in this example, because the eligibility trace for the action right in state B ends up being bigger than the eligibility trace for action left in state B and the rewards and all of the parameters are the same. In particular, the eligibility trace for action right will be about 5 when it ends up entering state A, but it be 1 for action left. Therefore, the best action will be to go right in state B, which is not correct.

What is wrong with your friend's argument? What does this example show?

Exercise 11.11:
In SARSA with linear function approximators, if you use linear regression to minimize r+γQw(s',a')-Qw(s,a), you get a different result than we have here. Explain what you get and why what is described in the text may be preferable (or not).
Exercise 11.12:
In Example 11.14, some of the features are perfectly correlated (e.g., F6 and F7). Does having such correlated features affect what functions can be represented? Does it help or hurt the speed at which learning occurs?