foundations of computational agents
In an imperfectinformation game, or a partially observable game, an agent does not fully know the state of the world or the agents act simultaneously.
Partial observability for the multiagent case is more complicated than the fully observable multiagent case or the partially observable singleagent case. The following simple examples show some important issues that arise even in the case of two agents, each with a few choices.
Consider the case of a penalty kick in soccer as depicted in Figure 11.7.

If the kicker kicks to his right and the goalkeeper jumps to her right, the probability of a goal is ${\mathrm{0.9}}$, and similarly for the other combinations of actions, as given in the figure.
What should the kicker do, given that he wants to maximize the probability of a goal and that the goalkeeper wants to minimize the probability of a goal? The kicker could think that it is better kicking to his right, because the pair of numbers for his right kick is higher than the pair for the left. The goalkeeper could then think that if the kicker will kick right, then she should jump left. However, if the kicker thinks that the goalkeeper will jump left, he should then kick left. But then, the goalkeeper should jump right. Then the kicker should kick right…
Each agent is potentially faced with an infinite regression of reasoning about what the other agent will do. At each stage in their reasoning, the agents reverse their decision. One could imagine cutting this off at some depth; however, the actions then are purely a function of the arbitrary depth. Even worse, if the goalkeeper knew the depth limit of reasoning for the kicker, she could exploit this knowledge to determine what the kicker will do and choose her action appropriately.
An alternative is for the agents to choose actions stochastically. Imagine that the kicker and the goalkeeper each secretly toss a coin to decide what to do. Consider whether the coins should be biased. Suppose that the kicker decides to kick to his right with probability ${{p}}_{{k}}$ and that the goalkeeper decides to jump to her right with probability ${{p}}_{{g}}$. The probability of a goal is then
$${P}{}{(}{g}{}{o}{}{a}{}{l}{)}{=}{0.9}{}{{p}}_{{k}}{}{{p}}_{{g}}{+}{0.3}{}{{p}}_{{k}}{}{(}{1}{}{{p}}_{{g}}{)}{+}{0.2}{}{(}{1}{}{{p}}_{{k}}{)}{}{{p}}_{{g}}{+}{0.6}{}{(}{1}{}{{p}}_{{k}}{)}{}{(}{1}{}{{p}}_{{g}}{)}$$ 
where the numbers (0.9, 0.3, etc.) come from Figure 11.7.
Figure 11.8 shows the probability of a goal as a function of ${{p}}_{{k}}$. The different lines correspond to different values of ${{p}}_{{g}}$.
There is something special about the value ${{p}}_{{k}}{\mathrm{=}}{\mathrm{0.4}}$. At this value, the probability of a goal is ${\mathrm{0.48}}$, independent of the value of ${{p}}_{{g}}$. That is, no matter what the goalkeeper does, the kicker expects to get a goal with probability ${\mathrm{0.48}}$. If the kicker deviates from ${{p}}_{{k}}{\mathrm{=}}{\mathrm{0.4}}$, he could do better or he could do worse, depending on what the goalkeeper does.
The plot for ${{p}}_{{g}}$ is similar, with all of the lines meeting at ${{p}}_{{g}}{\mathrm{=}}{\mathrm{0.3}}$. Again, when ${{p}}_{{g}}{\mathrm{=}}{\mathrm{0.3}}$, the probability of a goal is ${\mathrm{0.48}}$.
The strategy with ${{p}}_{{k}}{\mathrm{=}}{\mathrm{0.4}}$ and ${{p}}_{{g}}{\mathrm{=}}{\mathrm{0.3}}$ is special in the sense that neither agent can do better by unilaterally deviating from the strategy. However, this does not mean that they cannot do better; if one of the agents deviates from this equilibrium, the other agent could do better by also deviating from the equilibrium. However, this equilibrium is safe for an agent in the sense that, even if the other agent knew the agent’s strategy, the other agent cannot force a worse outcome for the agent. Playing this strategy means that an agent does not have to worry about doubleguessing the other agent. In this game, each agent will get the best payoff it could guarantee to obtain.
Dimensions: flat, states, finite horizon, partially observable, stochastic, utility, nonlearning, multiple agents, offline, perfect rationality
So let us now extend the definition of a strategy to include randomized strategies.
Consider the normal form of a game where each agent chooses an action simultaneously. Each agent chooses an action without knowing what the other agents choose.
A strategy for an agent is a probability distribution over the actions for this agent. In a pure strategy one of the probabilities will be 1 and the rest will be 0. Thus an agent following a pure strategy is acting deterministically. The alternative to a pure strategy is a stochastic strategy where none of the probabilities will be 1, and so more than one action will have a nonzero probability. The set of actions with a nonzero probability in a strategy is called the support set of the strategy.
A strategy profile is an assignment of a strategy to each agent. If $\sigma $ is a strategy profile, let ${\sigma}_{i}$ be the strategy of agent $i$ in $\sigma $, and let ${\sigma}_{i}$ be the strategies of the other agents. Then $\sigma $ is ${\sigma}_{i}{\sigma}_{i}$. If the strategy profile is made up of pure strategies, it is often called an action profile, because each agent is playing a particular action.
A strategy profile $\sigma $ has an expected utility for each agent. Let $utility(\sigma ,i)$ be the expected utility of strategy profile $\sigma $ for agent $i$. The utility of a stochastic strategy profile can be computed by averaging the utilities of the basic actions that make up the profile given the probabilities of the actions.
A best response for an agent $i$ to the strategies ${\sigma}_{i}$ of the other agents is a strategy that has maximal utility for that agent. That is, ${\sigma}_{i}$ is a best response to ${\sigma}_{i}$ if, for all other strategies ${\sigma}_{i}^{\prime}$ for agent $i$,
$$utility({\sigma}_{i}{\sigma}_{i},i)\ge utility({\sigma}_{i}^{\prime}{\sigma}_{i},i).$$ 
A strategy profile $\sigma $ is a Nash equilibrium if, for each agent $i$, strategy ${\sigma}_{i}$ is a best response to ${\sigma}_{i}$. That is, a Nash equilibrium is a strategy profile such that no agent can do better by unilaterally deviating from that profile.
One of the great results of game theory, proved by Nash [1950], is that every finite game has at least one Nash equilibrium.
In Example 11.9, there is a unique Nash equilibrium where ${{p}}_{{k}}{\mathrm{=}}{\mathrm{0.4}}$ and ${{p}}_{{g}}{\mathrm{=}}{\mathrm{0.3}}$. This has the property that, if the kicker is playing ${{p}}_{{k}}{\mathrm{=}}{\mathrm{0.4}}$, it does not matter what the goalkeeper does; the goalkeeper will have the same payoff, and so ${{p}}_{{g}}{\mathrm{=}}{\mathrm{0.3}}$ is a best response (as is any other strategy). Similarly, if the goalkeeper is playing ${{p}}_{{g}}{\mathrm{=}}{\mathrm{0.3}}$, it does not matter what the kicker does; and so every strategy, including ${{p}}_{{k}}{\mathrm{=}}{\mathrm{0.4}}$, is a best response.
The only reason an agent would consider randomizing between two actions is if the actions have the same expected utility. All probabilistic mixtures of the two actions have the same utility. The reason to choose a particular value for the probability of the mixture is to prevent the other agent from exploiting a deviation.
Games can have multiple Nash equilibria. Consider the following twoagent, twoaction game.
Suppose there is a resource that two agents may want to fight over. Each agent chooses to act as a hawk or as a dove. Suppose the resource is worth ${R}$ units, where ${R}{\mathrm{>}}{\mathrm{0}}$. If both agents act as doves, they share the resource. If one agent acts as a hawk and the other as a dove, the hawk agent gets the resource and the dove agent gets nothing. If they both act like hawks, there is destruction of the resource and the reward to both is ${\mathrm{}}{D}$, where ${D}{\mathrm{>}}{\mathrm{0}}$. This is depicted by the following payoff matrix:
Agent 2  

dove  hawk  
Agent 1  dove  ${R}{/}{2}{,}{R}{/}{2}$  ${0}{,}{R}$ 
hawk  ${R}{,}{0}$  ${}{D}{,}{}{D}$ 
In this matrix, Agent 1 gets to choose the row, Agent 2 gets to choose the column, and the payoff in the cell is a pair consisting of the reward to Agent 1 and the reward to Agent 2. Each agent is trying to maximize its own reward.
In this game there are three Nash equilibria:
In one equilibrium, Agent 1 acts as a hawk and Agent 2 as a dove. Agent 1 does not want to deviate because then they have to share the resource. Agent 2 does not want to deviate because then there is destruction.
In the second equilibrium, Agent 1 acts as a dove and Agent 2 as a hawk.
In the third equilibrium, both agents act stochastically. In this equilibrium, there is some chance of destruction. The probability of acting like a hawk goes up with the value ${R}$ of the resource and goes down as the value ${D}$ of destruction increases. See Exercise 2.
In this example, you could imagine each agent doing some posturing to try to indicate what it will do to try to force an equilibrium that is advantageous to it.
Having multiple Nash equilibria does not come from being adversaries, as the following example shows.
Suppose there are two people who want to be together. Agent 1 prefers they both go to the football game and Agent 2 prefers they both go shopping. They both would be unhappy if they are not together. Suppose they both have to choose simultaneously what activity to do. This is depicted by the following payoff matrix:
Agent 2  

football  shopping  
Agent 1  football  ${2}{,}{1}$  ${0}{,}{0}$ 
shopping  ${0}{,}{0}$  ${1}{,}{2}$ 
In this matrix, Agent 1 chooses the row, and Agent 2 chooses the column.
In this game, there are three Nash equilibria. One equilibrium is where they both go shopping, one is where they both go to the football game, and one is a randomized strategy.
This is a coordination problem. Knowing the set of equilibria does not actually tell either agent what to do, because what an agent should do depends on what the other agent will do. In this example, you could imagine conversations to determine which equilibrium they would choose.
Even when there is a unique Nash equilibrium, that Nash equilibrium does not guarantee the maximum payoff to each agent. The following example is a variant of what is known as the prisoner’s dilemma.
Imagine you are on a game show with a stranger who you will never see again. You each have the choice of
taking $100 for yourself or
giving $1000 to the other person.
This is depicted as the following payoff matrix:
Player 2  

take  give  
Player 1  take  ${100}{,}{100}$  ${1100}{,}{0}$ 
give  ${0}{,}{1100}$  ${1000}{,}{1000}$ 
No matter what the other agent does, each agent is better off if it takes rather than gives. However, both agents are better off if they both give rather than if they both take.
Thus, there is a unique Nash equilibrium, where both agents take. This strategy profile results in each player receiving $100. The strategy profile where both players give results in each player receiving $1000. However, in this strategy profile, each agent is rewarded for deviating.
There is a large body of research on the prisoner’s dilemma, because it does not seem to be so rational to be greedy, where each agent tries to do the best for itself, resulting in everyone being worse off. One case where giving becomes preferred is when the game is played a number of times. This is known as the sequential prisoner’s dilemma. One strategy for the sequential prisoner’s dilemma is titfortat: each player gives initially, then does the other agent’s previous action at each step. This strategy is a Nash equilibrium as long as there is no last action that both players know about. See Exercise 7.
Having multiple Nash equilibria arises not just from partial observability. It is possible to have multiple equilibria with a perfectinformation game, and it is even possible to have infinitely many Nash equilibria, as the following example shows.
Consider the sharing game of Example 11.2. In this game there are infinitely many Nash equilibria. There is a set of equilibria where Andy shares, and Barb says ${y}{\mathit{}}{e}{\mathit{}}{s}$ to sharing for the center choice and can randomize between the other choices, as long as the probability of saying ${y}{\mathit{}}{e}{\mathit{}}{s}$ in the lefthand choice is less than or equal to 0.5. In these Nash equilibria, they both get 1. There is another set of Nash equilibria where Andy keeps, and Barb randomizes among her choices so that the probability of saying ${y}{\mathit{}}{e}{\mathit{}}{s}$ in the left branch is greater than or equal to 0.5. In these equilibria, Barb gets 0, and Andy gets some value in range ${\mathrm{[}}{\mathrm{1}}{\mathrm{,}}{\mathrm{2}}{\mathrm{]}}$ depending on Barb’s probability. There is a third set of Nash equilibria where Barb has a 0.5 probability of selecting ${y}{\mathit{}}{e}{\mathit{}}{s}$ at the leftmost node, selects ${y}{\mathit{}}{e}{\mathit{}}{s}$ at the center node, and Andy randomizes between ${k}{\mathit{}}{e}{\mathit{}}{e}{\mathit{}}{p}$ and ${s}{\mathit{}}{h}{\mathit{}}{a}{\mathit{}}{r}{\mathit{}}{e}$ with any probability.
Suppose the sharing game were modified slightly so that Andy offered a small bribe for Barb to say ${y}{\mathit{}}{e}{\mathit{}}{s}$. This could be done by changing the ${\mathrm{2}}{\mathrm{,}}{\mathrm{0}}$ payoff to be ${\mathrm{1.9}}{\mathrm{,}}{\mathrm{0.1}}$. Andy may think, “Given the choice between getting 0.1 or 0, Barb will choose to get 0.1, so then I should ${k}{\mathit{}}{e}{\mathit{}}{e}{\mathit{}}{p}$.” But Barb could think, “I should say ${n}{\mathit{}}{o}$ to 0.1, so that Andy shares and I get 1.” In this example (even ignoring the rightmost branch), there are multiple pure Nash equilibria, one where Andy keeps, and Barb says ${y}{\mathit{}}{e}{\mathit{}}{s}$ at the leftmost branch. In this equilibrium, Andy gets 1.9 and Barb gets 0.1. There is another Nash equilibrium where Barb says ${n}{\mathit{}}{o}$ at the leftmost choice node and ${y}{\mathit{}}{e}{\mathit{}}{s}$ at the center branch and Andy chooses ${s}{\mathit{}}{h}{\mathit{}}{a}{\mathit{}}{r}{\mathit{}}{e}$. In this equilibrium, they both get 1. It would seem that this is the one preferred by Barb. However, Andy could think that Barb is making an empty threat. If he actually decided to ${k}{\mathit{}}{e}{\mathit{}}{e}{\mathit{}}{p}$, Barb, acting to maximize her utility, would not actually say ${n}{\mathit{}}{o}$.
The backward induction algorithm only finds one of the equilibria in the modified sharing game of the previous example. It computes a subgameperfect equilibrium, where it is assumed that the agents choose the action with greatest utility for them at every node where they get to choose. It assumes that agents do not carry out threats that it is not in their interest to carry out at the time. In the modified sharing game of the previous example, it assumes that Barb will say $yes$ to the small bribe. However, when dealing with real opponents, we must be aware that they may follow through with threats that we may not think rational. Indeed, it might be better for an agent not to (appear to) be rational!