12.10 Multiagent Reinforcement Learning

12.10.1 Perfect-Information Games

For a perfect-information game, where agents take turns and observe the state of the world before acting, and each agent acts to maximize its own utility, the above reinforcement learning algorithms can work unchanged. An agent can assume that the other agents are part of the environment. This works whether the opponent is playing its optimal strategy or is also learning. The reason this works is that there is a unique Nash equilibrium which is the value for the agent of the current node in the game tree. This strategy is the best response to the other agents.

If the opponent is not playing its optimal strategy or converging to an optimal strategy, a learning agent could converge to a non-optimal strategy. It is possible for an opponent to train a learning agent to carry out a non-optimal strategy by playing badly, and then for the opponent to change to another strategy in order to exploit the agent’s sub-optimal strategy. However, the learning agent could then learn from the (now) better opponent.

It is possible to use reinforcement learning to simulate both players in a game, and to learn for both. For two-player, zero-sum, perfect-information games, as in minimax, the game can be characterized by a single value that one agent is trying to minimize and the other is trying to maximize. In that case, an agent would learn Q(s,a), an estimate of this value for being in state s and carrying out action a. The algorithms can remain essentially the same, but need to know which player’s turn it is, and the Q-value would then be updated by maximizing or minimizing depending on which player’s turn it is.