foundations of computational agents
In an imperfectinformation game, or a partially observable game, an agent does not fully know the state of the world or the agents act simultaneously.
Partial observability for the multiagent case is more complicated than the fully observable multiagent case or the partially observable singleagent case. The following simple examples show some important issues that arise even in the case of two agents, each with a few choices.
Consider the case of a penalty kick in soccer as depicted in Figure 14.8.

If the kicker kicks to their right and the goalkeeper jumps to their right, the probability of a goal is $0.9$, and similarly for the other combinations of actions, as given in the figure.
What should the kicker do, given that they want to maximize the probability of a goal and the goalkeeper wants to minimize the probability of a goal? The kicker could think that it is better kicking to their right, because the pair of numbers for their right kick is higher than the pair for the left. The goalkeeper could then think that if the kicker will kick right, they should jump left. However, if the kicker thinks that the goalkeeper will jump left, they should then kick left. But then, the goalkeeper should jump right. Then the kicker should kick right…
Each agent is potentially faced with an infinite regression of reasoning about what the other agent will do. At each stage in their reasoning, the agents reverse their decision. One could imagine cutting this off at some depth; however, the actions then are purely a function of the arbitrary depth. Even worse, if the goalkeeper knew the depth limit of reasoning for the kicker, they could exploit this knowledge to determine what the kicker will do and choose their action appropriately.
An alternative is for the agents to choose actions stochastically. Imagine that the kicker and the goalkeeper each secretly toss a coin to decide what to do. Consider whether the coins should be biased. Suppose that the kicker decides to kick to their right with probability ${p}_{k}$ and that the goalkeeper decides to jump to their right with probability ${p}_{g}$. The probability of a goal is then
$$P(goal)=0.9{p}_{k}{p}_{g}+0.3{p}_{k}(1{p}_{g})+0.2(1{p}_{k}){p}_{g}+0.6(1{p}_{k})(1{p}_{g})$$ 
where the numbers (0.9, 0.3, etc.) come from Figure 14.8.
Figure 14.9 shows the probability of a goal as a function of ${p}_{k}$. The different lines correspond to different values of ${p}_{g}$.
There is something special about the value ${p}_{k}=0.4$. At this value, the probability of a goal is $0.48$, independent of the value of ${p}_{g}$. That is, no matter what the goalkeeper does, the kicker expects to get a goal with probability $0.48$. If the kicker deviates from ${p}_{k}=0.4$, they could do better or could do worse, depending on what the goalkeeper does.
The plot for ${p}_{g}$ is similar, with all of the lines meeting at ${p}_{g}=0.3$. Again, when ${p}_{g}=0.3$, the probability of a goal is $0.48$.
The strategy with ${p}_{k}=0.4$ and ${p}_{g}=0.3$ is special in the sense that neither agent can do better by unilaterally deviating from the strategy. However, this does not mean that they cannot do better; if one of the agents deviates from this equilibrium, the other agent could do better by also deviating from the equilibrium. However, this equilibrium is safe for an agent in the sense that, even if the other agent knew the agent’s strategy, the other agent cannot force a worse outcome for the agent. Playing this strategy means that an agent does not have to worry about doubleguessing the other agent. In this game, each agent will get the best payoff it could guarantee to obtain.
So let us now extend the definition of a strategy to include randomized strategies.
Consider the normal form of a game where each agent chooses an action simultaneously. Each agent chooses an action without knowing what the other agents choose.
A strategy for an agent is a probability distribution over the actions for this agent. In a pure strategy, one of the probabilities will be 1 and the rest will be 0. Thus, an agent following a pure strategy is acting deterministically. The alternative to a pure strategy is a stochastic strategy, where none of the probabilities are 1, and so more than one action has a nonzero probability. The set of actions with a nonzero probability in a strategy is the support set of the strategy.
A strategy profile is an assignment of a strategy to each agent. If $\sigma $ is a strategy profile, let ${\sigma}_{i}$ be the strategy of agent $i$ in $\sigma $, and let ${\sigma}_{i}$ be the strategies of the other agents. Then $\sigma $ is ${\sigma}_{i}{\sigma}_{i}$. If the strategy profile is made up of pure strategies, it is often called an action profile, because each agent is playing a particular action.
A strategy profile $\sigma $ has an expected utility for each agent. Let $utility(\sigma ,i)$ be the expected utility of strategy profile $\sigma $ for agent $i$. The utility of a stochastic strategy profile can be computed by averaging the utilities of the basic actions that make up the profile given the probabilities of the actions.
A best response for an agent $i$ to the strategies ${\sigma}_{i}$ of the other agents is a strategy that has maximal utility for that agent. That is, ${\sigma}_{i}$ is a best response to ${\sigma}_{i}$ if, for all other strategies ${\sigma}_{i}^{\prime}$ for agent $i$:
$$utility({\sigma}_{i}{\sigma}_{i},i)\ge utility({\sigma}_{i}^{\prime}{\sigma}_{i},i).$$ 
A strategy profile $\sigma $ is a Nash equilibrium if, for each agent $i$, strategy ${\sigma}_{i}$ is a best response to ${\sigma}_{i}$. That is, a Nash equilibrium is a strategy profile such that no agent can do better by unilaterally deviating from that profile.
One of the great results of game theory, proved by Nash [1950], is that every finite game has at least one Nash equilibrium.
In Example 14.9, there is a unique Nash equilibrium where ${p}_{k}=0.4$ and ${p}_{g}=0.3$. This has the property that, if the kicker is playing ${p}_{k}=0.4$, it does not matter what the goalkeeper does; the goalkeeper will have the same payoff, and so ${p}_{g}=0.3$ is a best response (as is any other strategy). Similarly, if the goalkeeper is playing ${p}_{g}=0.3$, it does not matter what the kicker does; and so every strategy, including ${p}_{k}=0.4$, is a best response.
The only reason an agent would consider randomizing between two actions is if the actions have the same expected utility. All probabilistic mixtures of the two actions have the same utility. The reason to choose a particular value for the probability of the mixture is to prevent the other agent from exploiting a deviation.
Games can have multiple Nash equilibria. Consider the following twoagent, twoaction game.
Suppose there is a resource that two agents may want to fight over. Each agent chooses to act as a hawk or as a dove. Suppose the resource is worth $R$ units, where $R>0$. If both agents act as doves, they share the resource. If one agent acts as a hawk and the other as a dove, the hawk agent gets the resource and the dove agent gets nothing. If they both act like hawks, there is destruction of the resource and the reward to both is $D$, where $D>0$. This is depicted by the following payoff matrix:
Agent 2  

dove  hawk  
Agent 1  dove  $R/2,R/2$  $0,R$ 
hawk  $R,0$  $D,D$ 
In this matrix, Agent 1 gets to choose the row, Agent 2 gets to choose the column, and the payoff in the cell is a pair consisting of the reward to Agent 1 and the reward to Agent 2. Each agent is trying to maximize its own reward.
In this game there are three Nash equilibria:
In one equilibrium, Agent 1 acts as a hawk and Agent 2 as a dove. Agent 1 does not want to deviate because then they have to share the resource. Agent 2 does not want to deviate because then there is destruction.
In the second equilibrium, Agent 1 acts as a dove and Agent 2 as a hawk.
In the third equilibrium, both agents act stochastically. In this equilibrium, there is some chance of destruction. The probability of acting like a hawk goes up with the value $R$ of the resource and goes down as the value $D$ of destruction increases. See Exercise 14.3.
In this example, you could imagine each agent doing some posturing to try to indicate what it will do to try to force an equilibrium that is advantageous to it.
Having multiple Nash equilibria does not come from being adversaries, as the following example shows.
Suppose there are two people who want to be together. Agent 1 prefers they both go to the football game and Agent 2 prefers they both go shopping. They both would be unhappy if they are not together. Suppose they both have to choose simultaneously what activity to do. This is depicted by the following payoff matrix:
Agent 2  

football  shopping  
Agent 1  football  $2,1$  $0,0$ 
shopping  $0,0$  $1,2$ 
In this matrix, Agent 1 chooses the row, and Agent 2 chooses the column.
In this game, there are three Nash equilibria. One equilibrium is where they both go shopping, one is where they both go to the football game, and one is a randomized strategy.
This is a coordination problem. Knowing the set of equilibria does not actually tell either agent what to do, because what an agent should do depends on what the other agent will do. In this example, you could imagine conversations to determine which equilibrium they would choose.
Even when there is a unique Nash equilibrium, that Nash equilibrium does not guarantee the maximum payoff to each agent. The following example is a variant of what is known as the prisoner’s dilemma.
Imagine you are on a game show with a stranger who you will never see again. You each have the choice of
taking $100 for yourself, or
giving $1000 to the other person.
This is depicted as the following payoff matrix:
Player 2  

take  give  
Player 1  take  $100,100$  $1100,0$ 
give  $0,1100$  $1000,1000$ 
No matter what the other agent does, each agent is better off if it takes rather than gives. However, both agents are better off if they both give rather than if they both take.
Thus, there is a unique Nash equilibrium, where both agents take. This strategy profile results in each player receiving $100. The strategy profile where both players give results in each player receiving $1000. However, in this strategy profile, each agent is rewarded for deviating.
There is a large body of research on the prisoner’s dilemma, because it does not seem to be so rational to be greedy, where each agent tries to do the best for itself, resulting in everyone being worse off. One case where giving becomes preferred is when the game is played a number of times. This is known as the sequential prisoner’s dilemma. One strategy for the sequential prisoner’s dilemma is titfortat: each player gives initially, then does the other agent’s previous action at each step. This strategy is a Nash equilibrium as long as there is no last action that both players know about. See Exercise 14.8.
Having multiple Nash equilibria arises not just from partial observability. It is possible to have multiple equilibria with a perfectinformation game, and it is even possible to have infinitely many Nash equilibria, as the following example shows.
Consider the sharing game of Example 14.2. In this game there are infinitely many Nash equilibria. There is a set of equilibria where Andy shares, and Barb says $yes$ to sharing for the center choice and can randomize between the other choices, as long as the probability of saying $yes$ in the lefthand choice is less than or equal to 0.5. In these Nash equilibria, they both get 1. There is another set of Nash equilibria where Andy keeps, and Barb randomizes among her choices, so that the probability of saying $yes$ in the left branch is greater than or equal to 0.5. In these equilibria, Barb gets 0 and Andy gets some value in the range $[1,2]$ depending on Barb’s probability. There is a third set of Nash equilibria where Barb has a 0.5 probability of selecting $yes$ at the leftmost node, selects $yes$ at the center node, and Andy randomizes between $keep$ and $share$ with any probability.
Suppose the sharing game were modified slightly so that Andy offered a small bribe for Barb to say $yes$. This could be done by changing the $2,0$ payoff to be $1.9,0.1$. Andy may think, “Given the choice between getting 0.1 or 0, Barb will choose to get 0.1, so then I should $keep$.” But Barb could think, “I should say $no$ to 0.1, so that Andy shares and I get 1.” In this example (even ignoring the rightmost branch), there are multiple pure Nash equilibria. One is where Andy keeps and Barb says $yes$ at the leftmost branch. In this equilibrium, Andy gets 1.9 and Barb gets 0.1. There is another Nash equilibrium where Barb says $no$ at the leftmost choice node and $yes$ at the center branch and Andy chooses $share$. In this equilibrium, they both get 1. It would seem that this is the one preferred by Barb. However, Andy could think that Barb is making an empty threat. If he actually decided to $keep$, Barb, acting to maximize her utility, would not actually say $no$.
The backwardinduction algorithm only finds one of the equilibria in the modified sharing game of the previous example. It computes a subgameperfect equilibrium, where it is assumed that the agents choose the action with greatest utility for them at every node where they get to choose. It assumes that agents do not carry out threats that it is not in their interest to carry out at the time. In the modified sharing game of the previous example, it assumes that Barb will say $yes$ to the small bribe. However, when dealing with real opponents, you should be aware that they may follow through with threats that you may not think rational. Indeed, it might be better for an agent not to (appear to) be rational!
To compute a Nash equilibrium for a game in normal form, there are three steps:
Eliminate dominated strategies.
Determine the support set, the set of actions which have nonzero probabilities.
Determine the probability for the actions in the support set.
It turns out that the second of these is the most difficult.
A strategy ${s}_{1}$ for agent $A$ is dominated by strategy ${s}_{2}$ for $A$ if, for every action of the other agents, the utility of ${s}_{1}$ for agent $A$ is lower than the utility of ${s}_{2}$ for agent $A$. This is formalized below. Any pure strategy dominated by another strategy can be eliminated from consideration. The dominating strategy could be a randomized strategy. Removing dominated strategies can be done repeatedly.
Consider the following payoff matrix, where the first agent chooses the row and the second agent chooses the column. In each cell is a pair of payoffs: the payoff for Agent 1 and the payoff for Agent 2. Agent 1 has actions $\{{a}_{1},{b}_{1},{c}_{1}\}$. Agent 2 has actions $\{{d}_{2},{e}_{2},{f}_{2}\}$.
Agent 2  

${d}_{2}$  ${e}_{2}$  ${f}_{2}$  
${a}_{1}$  $3,5$  $5,1$  $1,2$  
Agent 1  ${b}_{1}$  $1,1$  $2,9$  $6,4$ 
${c}_{1}$  $2,6$  $4,7$  $0,8$ 
(Before looking at the solution, try to work out what each agent should do.)
Action ${c}_{1}$ can be removed because it is dominated by action ${a}_{1}$: Agent 1 will never do ${c}_{1}$ if action ${a}_{1}$ is available to it. Notice how the payoff for Agent 1 is greater doing ${a}_{1}$ than doing ${c}_{1}$, no matter what the other agent does.
Once action ${c}_{1}$ is eliminated, action ${f}_{2}$ can be eliminated because it is dominated for Agent 2 by the randomized strategy $0.5\ast {d}_{2}+0.5\ast {e}_{2}$.
Once ${c}_{1}$ and ${f}_{2}$ have been eliminated, ${b}_{1}$ is dominated by ${a}_{1}$, and so Agent 1 will do action ${a}_{1}$. Given that Agent 1 will do ${a}_{1}$, Agent 2 will do ${d}_{2}$. Thus the unique Nash equilibrium has Agent 1 doing ${a}_{1}$ and Agent 2 doing ${d}_{2}$, with a payoff of 3 for Agent 1 and 5 for Agent 2.
Strategy ${s}_{1}$ strictly dominates strategy ${s}_{2}$ for Agent $i$ if, for all action profiles ${\sigma}_{i}$ of the other agents:
$$utility({s}_{1}{\sigma}_{i},i)>utility({s}_{2}{\sigma}_{i},i)$$ 
in which case ${s}_{2}$ is strictly dominated by ${s}_{1}$. If ${s}_{2}$ is a pure strategy that is strictly dominated by some strategy ${s}_{1}$, then ${s}_{2}$ can never be in the support set of any Nash equilibrium. This holds even if ${s}_{1}$ is a stochastic strategy. Repeated elimination of strictly dominated strategies gives the same result, regardless of the order in which the strictly dominated strategies are removed.
There are also weaker notions of domination, where the greater than symbol in the preceding formula is replaced by greater than or equal. If the weaker notion of domination is used, there is always a Nash equilibrium with support of the nondominated strategies. However, some Nash equilibria may be lost. Which equilibria are lost can depend on the order in which the dominated strategies are removed.
An agent will only randomize among actions if the actions all have the same utility to the agent, given the strategies of the other agents. This idea leads to a set of constraints that can be solved to compute a Nash equilibrium. If these constraints can be solved with numbers in the range $(0,1)$, and the mixed strategies computed for each agent are not dominated by another strategy for the agent, then this strategy profile is a Nash equilibrium.
Recall that a support set is a set of pure strategies that each have nonzero probability in a Nash equilibrium.
Once dominated strategies have been eliminated, an agent can search over support sets to determine whether the support sets form a Nash equilibrium. Note that, if there are $n$ actions available to an agent, there are ${2}^{n}1$ nonempty subsets, and we have to search over combinations of support sets for the various agents. This is not feasible unless there are only a few nondominated actions or there are Nash equilibria with small support sets. To find simple (in terms of the number of actions in the support set) equilibria, an agent can search from smaller support sets to larger sets.
Suppose agent $i$ is randomizing actions ${a}_{i}^{1},\mathrm{\dots},{a}_{i}^{{k}_{i}}$ in a Nash equilibrium, each with a nonzero probability. Let ${p}_{i}^{j}$ be the probability that agent $i$ does action ${a}_{i}^{j}$. Let ${\sigma}_{i}$ be the strategies for the other agents, which is a function of their probabilities. The fact that this is a Nash equilibrium gives the following constraints: ${p}_{i}^{j}>0$, ${\sum}_{j=1}^{{k}_{i}}{p}_{i}^{j}=1$, and, for all $j,{j}^{\prime}$
$$utility({a}_{i}^{j}{\sigma}_{i},i)=utility({a}_{i}^{{j}^{\prime}}{\sigma}_{i},i).$$ 
You also require that the utility of doing ${a}_{i}^{j}$ is not less than the utility of doing an action outside of the support set. Thus, for all ${a}^{\prime}\notin \{{a}_{i}^{1},\mathrm{\dots},{a}_{i}^{{k}_{i}}\}$:
$$utility({a}_{i}^{j}{\sigma}_{i},i)\ge utility({a}^{\prime}{\sigma}_{i},i).$$ 
In Example 14.9, suppose the goalkeeper jumps right with probability ${p}_{g}$ and the kicker kicks right with probability ${p}_{k}$.
If the goalkeeper jumps right, the probability of a goal is
$$0.9{p}_{k}+0.2(1{p}_{k}).$$ 
If the goalkeeper jumps left, the probability of a goal is
$$0.3{p}_{k}+0.6(1{p}_{k}).$$ 
The only time the goalkeeper would randomize is if these are equal; that is, if
$$0.9{p}_{k}+0.2(1{p}_{k})=0.3{p}_{k}+0.6(1{p}_{k}).$$ 
Solving for ${p}_{k}$ gives ${p}_{k}=0.4$.
Similarly, for the kicker to randomize, the probability of a goal must be the same whether the kicker kicks left or right:
$$0.2{p}_{g}+0.6(1{p}_{g})=0.9{p}_{g}+0.3(1{p}_{g}).$$ 
Solving for ${p}_{g}$ gives ${p}_{g}=0.3$.
Thus, the only Nash equilibrium has ${p}_{k}=0.4$ and ${p}_{g}=0.3$.