foundations of computational agents
Explain how Q-learning fits in with the agent architecture of Section 2.2.1. Suppose that the Q-learning agent has discount factor $\gamma $, a step size of $\alpha $, and is carrying out an $\u03f5$-greedy exploration strategy.
What are the components of the belief state of the Q-learning agent?
What are the percepts?
What is the command function of the Q-learning agent?
What is the belief-state transition function of the Q-learning agent?
For the plot of the total reward as a function of time as in Figure 12.4, the minimum and zero crossing are only meaningful statistics when balancing positive and negative rewards is reasonable behavior. Suggest what should replace these statistics when zero reward is not an appropriate definition of reasonable behavior. [Hint: Think about the cases that have only positive reward or only negative reward.]
Compare the different parameter settings for the game of Example 12.2. In particular compare the following situations
$\alpha $ varies, and the $Q$-values are initialized to 0.0
$\alpha $ varies, and the $Q$-values are initialized to 5.0
$\alpha $ is fixed to 0.1, and the $Q$-values are initialized to 0.0
$\alpha $ is fixed to 0.1, and the $Q$-values are initialized to 5.0
Some other parameter settings.
For each of these, carry out multiple runs and compare
the distributions of minimum values
the zero crossing
the asymptotic slope for the policy that includes exploration
the asymptotic slope for the policy that does not include exploration. To test this, after the algorithm has explored, set the exploitation parameter to 100% and run additional steps.
Which of these settings would you recommend? Why?
For the following reinforcement learning algorithms:
Q-learning with fixed $\alpha $ and $80\%$ exploitation.
Q-learning with fixed ${\alpha}_{k}=1/k$ and $80\%$ exploitation.
Q-learning with ${\alpha}_{k}=1/k$ and $100\%$ exploitation.
SARSA learning with ${\alpha}_{k}=1/k$ and $80\%$ exploitation.
SARSA learning with ${\alpha}_{k}=1/k$ and $100\%$ exploitation.
Feature-based SARSA learning with soft-max action selection.
A model-based reinforcement learner with $50\%$ exploitation.
Which of the reinforcement learning algorithms will find the optimal policy, given enough time?
Which ones will actually follow the optimal policy?
Consider four different ways to derive the value of ${\alpha}_{k}$ from $k$ in Q-learning (note that for Q-learning with varying ${\alpha}_{k}$, there must be a different count $k$ for each state–action pair).
Let ${\alpha}_{k}=1/k$.
Let ${\alpha}_{k}=10/(9+k)$.
Let ${\alpha}_{k}=0.1$.
Let ${\alpha}_{k}=0.1$ for the first 10,000 steps, ${\alpha}_{k}=0.01$ for the next 10,000 steps, ${\alpha}_{k}=0.001$ for the next 10,000 steps, ${\alpha}_{k}=0.0001$ for the next 10,000 steps, and so on.
Which of these will converge to the true $Q$-value in theory?
Which converges to the true $Q$-value in practice (i.e., in a reasonable number of steps)? Try it for more than one domain.
Which are able to adapt if the environment changes slowly?
The model-based reinforcement learner allows for a different form of optimism in the face of uncertainty. The algorithm can be started with each state having a transition to a “nirvana” state, which has very high $Q$-value (but which will never be reached in practice, and so the probability will shrink to zero).
Does this perform differently than initialing all $Q$-values to a high value? Does it work better, worse or the same?
How high does the Q-value for the nirvana state need to be to work most effectively? Suggest a reason why one value might be good, and test it.
Could this method be used for the other RL algorithms? Explain how or why not.
Included the features for the grid game of Example 12.6, are features that are the $x$-distance to the current treasure and features that are the $y$-distance to the current treasure. Chris thought that these were not useful as they do not depend on the action. Do these features make a difference? Explain why they might or might not. Do they make a difference in practice?
In SARSA with linear function approximation, using linear regression to minimize $r+\gamma {Q}_{\overline{w}}({s}^{\prime},{a}^{\prime})-{Q}_{\overline{w}}(s,a)$, gives a different algorithm than Figure 12.7. Explain what you get and why what is described in the text may be preferable (or not).
In Example 12.6, some of the features are perfectly correlated (e.g., ${F}_{6}$ and ${F}_{7}$). Does having such correlated features affect what functions are able to be represented? Does it help or hurt the speed at which learning occurs? Test this empirically on some examples.
Consider the policy improvement algorithm. At equilibrium the values of the most-preferred actions should be equal. Propose, implement and evaluate an algorithm where the policy does not change very much when the values of the most-preferred actions are close. [Hint: Consider having the probability of all actions change in proportion to the distance from the best action and use a temperature parameter in the definition of distance.]