foundations of computational agents
Explain how Q-learning fits in with the agent architecture of Section 2.1.1. Suppose that the Q-learning agent has discount factor $\gamma $, a step size of $\alpha $, and is carrying out an $\u03f5$-greedy exploration strategy.
What are the components of the belief state of the Q-learning agent?
What are the percepts?
What is the command function of the Q-learning agent?
What is the belief-state transition function of the Q-learning agent?
Suppose a Q-learning agent, with fixed $\alpha $ and discount $\gamma $, was in state 34, did action 7, received reward 3, and ended up in state 65. What value(s) get updated? Give an expression for the new value. (Be as specific as possible.)
For the plot of the total reward as a function of time as in Figure 13.4, the minimum and zero crossing are only meaningful statistics when balancing positive and negative rewards is reasonable behavior. Suggest what should replace these statistics when zero reward is not an appropriate definition of reasonable behavior. [Hint: Think about the cases that have only positive reward or only negative reward.]
Compare the different parameter settings for $Q$-learning for the game of Example 13.2 (the “monster game” in AIPython (aipython.org)) In particular, compare the following situations:
$step\mathrm{\_}size(c)=1/c$ and the $Q$-values are initialized to 0.0.
$step\mathrm{\_}size(c)=10/(9+c)$ varies, and the $Q$-values are initialized to 0.0.
$\alpha $ varies (using whichever of (i) and (ii) is better) and the $Q$-values are initialized to 5.0.
$\alpha $ is fixed to 0.1 and the $Q$-values are initialized to 0.0.
$\alpha $ is fixed to 0.1 and the $Q$-values are initialized to 5.0.
Some other parameter settings.
For each of these, carry out multiple runs and compare
the distributions of minimum values
the zero crossing
the asymptotic slope for the policy that includes exploration
the asymptotic slope for the policy that does not include exploration (to test this, after the algorithm has explored, set the exploitation parameter to 100% and run additional steps).
Which of these settings would you recommend? Why?
For the following reinforcement learning algorithms:
Q-learning with fixed $\alpha $ and $80\%$ exploitation.
Q-learning with fixed ${\alpha}_{k}=1/k$ and $80\%$ exploitation.
Q-learning with ${\alpha}_{k}=1/k$ and $100\%$ exploitation.
SARSA learning with ${\alpha}_{k}=1/k$ and $80\%$ exploitation.
SARSA learning with ${\alpha}_{k}=1/k$ and $100\%$ exploitation.
Feature-based SARSA learning with softmax action selection.
A model-based reinforcement learner with $50\%$ exploitation.
Which of the reinforcement learning algorithms will find the optimal policy, given enough time?
Which ones will actually follow the optimal policy?
Consider four different ways to derive the value of ${\alpha}_{k}$ from $k$ in Q-learning (note that for Q-learning with varying ${\alpha}_{k}$, there must be a different count $k$ for each state–action pair).
Let ${\alpha}_{k}=1/k$.
Let ${\alpha}_{k}=10/(9+k)$.
Let ${\alpha}_{k}=0.1$.
Let ${\alpha}_{k}=0.1$ for the first 10,000 steps, ${\alpha}_{k}=0.01$ for the next 10,000 steps, ${\alpha}_{k}=0.001$ for the next 10,000 steps, ${\alpha}_{k}=0.0001$ for the next 10,000 steps, and so on.
Which of these will converge to the true $Q$-value in theory?
Which converges to the true $Q$-value in practice (i.e., in a reasonable number of steps)? Try it for more than one domain.
Which are able to adapt if the environment changes slowly?
The model-based reinforcement learner allows for a different form of optimism in the face of uncertainty. The algorithm can be started with each state having a transition to a “nirvana” state, which has very high $Q$-value (but which will never be reached in practice, and so the probability will shrink to zero).
Does this perform differently than initializing all $Q$-values to a high value? Does it work better, worse, or the same?
How high does the Q-value for the nirvana state need to be to work most effectively? Suggest a reason why one value might be good, and test it.
Could this method be used for the other RL algorithms? Explain how or why not.
The grid game of Example 13.6 included features for the $x$-distance to the current treasure and are the $y$-distance to the current treasure. Chris thought that these were not useful as they do not depend on the action. Do these features make a difference? Explain why they might or might not. Do they make a difference in practice?
In SARSA with linear function approximation, using linear regression to minimize $r+\gamma {Q}_{\overline{w}}({s}^{\prime},{a}^{\prime})-{Q}_{\overline{w}}(s,a)$ gives a different algorithm than Figure 13.8. Explain what you get and why what is described in the text may be preferable (or not). [Hint: what should the weights be adjusted to better estimate?]
In Example 13.6, some of the features are perfectly correlated (e.g., ${F}_{6}$ and ${F}_{7}$). Does having such correlated features affect what functions are able to be represented? Does it help or hurt the speed at which learning occurs? Test this empirically on some examples.