foundations of computational agents
Modify Figure 14.5 to include nature moves. Test it on a (simple) perfect information game that includes randomized moves (e.g., coin toss or roll of a dice). Recall that in an extensive form of a game, each internal node labeled with nature has a probability distribution over its children.
Consider the game of Tic-Tac-Toe (also called noughts and crosses), which is played by two players, an “X” player and an “O” player who alternate putting their symbol in a blank space on a $3\times 3$ game board. A player’s goal is to win by placing three symbols in a row, column, or diagonal; the game ends when a player wins or the board is filled. In the game shown below, player O has just made its third turn. It is X’s turn to make its fourth move. The playing agent needs to decide intelligently which of the available three moves X should choose next: X1, X2, or X3. We have started the search tree, with three branches for the three possible moves for X:
Draw the rest of the game tree. Assume the value of the tree is $+1$ for an X win, $-1$ for an O win, and 0 for a draw. Show how the values are backed up to give a value for each of the nodes. What should $X$ do? Is it likely to win, lose, or draw?
Could $\alpha $–$\beta $ pruning prune any of the tree?
For the hawk–dove game of Example 14.11, where $D>0$ and $R>0$, each agent is trying to maximize its utility. Is there a Nash equilibrium with a randomized strategy? What are the probabilities? What is the expected payoff to each agent? (These should be expressed as functions of $R$ and $D$.) Show your calculations.
Which of the following games in normal form have a Nash equilibrium made up of pure strategies? For those that do, specify the pure strategy Nash equilibria. For those that do not, explain how you know there is no pure strategy Nash equilibrium.
Player 2 | |||
---|---|---|---|
a2 | b2 | ||
Player 1 | a1 | 10, 10 | 110, 0 |
b1 | 0, 110 | 100, 100 |
Player 2 | |||
---|---|---|---|
a2 | b2 | ||
Player 1 | a1 | 10, 10 | 11, 20 |
b1 | 0, 11 | 20, 1 |
Player 2 | |||
---|---|---|---|
a2 | b2 | ||
Player 1 | a1 | 10, 20 | 5, 10 |
b1 | 7, 11 | 20, 12 |
In Example 14.12, what is the Nash equilibrium with randomized strategies? What is the expected value for each agent in this equilibrium?
Consider the following normal-form game where the row player can choose action $A$, $B$, or $C$ and the column player could choose action $D$, $E$, or $F$:
$D$ | $E$ | $F$ | |
---|---|---|---|
$A$ | 40, 40 | 120, 10 | 60, 30 |
$B$ | 30, 60 | 110, 60 | 90, 90 |
$C$ | 30, 110 | 100, 100 | 70, 120 |
where the pairs give the value of the outcome for the row player followed by the value for the column player.
When eliminating dominated strategies, what strategies (if any) can be eliminated? Explain what is eliminated, and what cannot be eliminated.
Specify a Nash equilibrium for this game. (For a randomized strategy, give just the actions that are randomized; you do not need to give the probabilities). Explain why it is a Nash equilibrium.
Is there more than one Nash equilibrium? If so, give another one. If not explain why there is no other one.
If the agents could coordinate, could they get a better outcome than in a Nash equilibrium? Explain why or why not.
Answer the same questions as in the previous exercise for the following games:
(i) $D$ $E$ $F$ $A$ 2, 11 10, 10 3, 12 $B$ 5, 7 12, 1 6, 5 $C$ 6, 5 13, 2 4, 6 (ii) $D$ $E$ $F$ $A$ 80, 130 20, 10 130, 80 $B$ 130, 80 30, 20 80, 130 $C$ 20, 10 100, 100 30, 20
Consider the sequential prisoner’s dilemma.
Suppose the agents play for a fixed number of times (say three times). Give two equilibria if there are two or more, otherwise, give the unique equilibrium and explain why there is only one. [Hint: Consider the last time first.]
Suppose there is a discount factor of $\gamma $, which means there is a probability $\gamma $ of stopping at each stage. Is tit-for-tat a Nash equilibrium for all values of $\gamma $? If so, prove it. If not, for which values of $\gamma $ is it a Nash equilibrium?
Consider the following alternative ways to update the probability $P$ in the stochastic policy iteration algorithm of Figure 14.10.
Make more recent experiences have more weight by multiplying the counts in $P$ by $(1-\beta )$, for small $\beta $ (such as 0.01), before adding 1 to the best action.
Add some small value, $\u03f5$ (such as 0.01 or 0.001), to the probability of the best action and subtract values from the other actions to make sure the probabilities are non-negative and sum to 1.
Which of the original, (i), or (ii) has the best payoffs for the game of Example 14.15, where there is a unique Nash equilibrium but another strategy profile has a better payoff for both agents?
Which one has the best payoffs in the penalty kick game of Example 14.9 when played against the others?
Which of the original and the alternatives, if any, converge to a Nash equilibrium in the strategies played (averaging over all of the actions played)? (Do this experimentally, creating hypotheses from your observations, and then try to prove your hypotheses.)
The stochastic policy iteration algorithm of Figure 14.10 is based on SARSA). How could it be modified to be off-policy as in $Q$-learning? [Hint: $Q$-learning updates using the best action, SARSA updates by the one using the policy and stochastic policy iteration updates using the expected value.] Does this work better for the example domains?