Multi-Agent Dynamical Systems (2/3)

This second blog-post of the multi-agent series is dedicated to solution concepts: how we can characterize and compare different joint policies. We will cover Pareto optimality, the definition and existence of Nash equilibrium, as well as the minimax theorem in two-agents zero-sum games.

In stark contrast with the single-agent case ($n=1$), calling some joint policy “optimal” in a multi-agent context ($n>1$) is ambiguous at best. In a multi-agent setting, an outside observer considers the whole joint policy; that is, the collection of marginal agent-centric controllers. While each agent tries to maximise its own return or utility, asking said observer (having no a-priori preference over agents) to rank different joint policies is an ill-defined problem.

$\quad$ We will adopt the notations of the first post and some notations from our MDP series. Further, for the sake of simplicity, we assume finite games. That is, the state, action and observation spaces are all discrete and finite: $\vert\mathcal{S}\vert \times \vert\mathcal{A}\vert \times \vert\Omega\vert < \infty$.

Ranking joint policies

Consider a general-sum partially observable stochastic game $\mathcal{M} = (n, \mathcal{S}, \mathcal{A}, \Omega, p, q, \bm{r})$. For the sake of simplicity, we assume here that agents pursue a discounted reward objective. Formally, this means that the value of some joint policy $\bm{\pi}$ in the eyes of some agent $i$ is: $$ v_i(\bm{\pi}) := \mathbb{E}^{\bm{\pi}}_\nu\left[\sum_{t\geq 1} \lambda^{t-1} r^i(s_t, \bm{a}_t)\right]\; , $$ where $\nu$ is some initial state distribution. The joint value function $\bm{v}_\lambda(\bm{\pi})$ is simply the collection of the agents marginal values: $ \bm{v}(\bm{\pi}) := (v_1(\bm{\pi}), \ldots, v_n(\bm{\pi}))\in\mathbb{R}^n\; . $ For $\bm{\pi}_1, \bm{\pi}_2$ two given joint policies, there is no a-priori way to establish a strict order between the two: $\bm{v}(\bm{\pi}_1)$ and $\bm{v}(\bm{\pi}_2)$ are not comparable.

The only non-ambiguous case concerns cooperative games (shared reward), where, by construction, for every joint $\bm{\pi}$ one has: $ v_1(\bm{\pi}) = \ldots = v_n(\bm{\pi})\; . $ The problem of raking policies collapses back to the single-agent case, since one can now look for an optimal policy w.r.t to the coalition’s value.

Pareto Optimality

Even if, in general, we cannot provide a strict order over joint values, we can still establish some partial ordering. Concretely, this means that some situations can clearly be preferred over others: for instance, when one comes with a joint improvement over another. This is called Pareto domination.

$\qquad\qquad\qquad\qquad \;\text{ Let } \bm{\pi}_1 \text{and }\bm{\pi}_2\text{ two joint policies. We say that } \bm{\pi}_1\text{ Pareto-dominates } \bm{\pi}_2\text{ if:}$ $$ \forall i \in \{1, \ldots, n\}, \; v_i(\pmb{\pi}_1) \geq v_i(\pmb{\pi}_2)\; . $$
Pareto dominance

A Pareto-dominated policy can be jointly improved; any agent’s return can be increased, without hindering the return of the others. Any policy that is not Pareto-dominated by another is called Pareto-optimal. For any such policy, the return of an agent can no longer be improved without having another see its value drop.

It is easy to check that every finite game admits a Pareto-optimal policy. Often, there will exist several Pareto-optimal policies, not all of them being useful or relevant. For instance, in a zero-sum game, every joint policy is Pareto-optimal! (This is a direct result of the reward structure.) At the opposite end of the spectrum where lay cooperative games, all Pareto-optimal policies have the same value (and are therefore completely equivalent). This visible variability behind the interpretation of Pareto-optimality testifies of this concept’s relative weakness. It is useful to trim out some dull policies, but not much more. Game-theorists have therefore somewhat let go of the concept to develop stronger agent-centric solution concepts.


A best-response policy is not a solution concept per se as it only concerns marginal policies. The idea will however be quite useful to introduce the Nash equilibrium, a proper solution concept. To understand what a best-response is, let’s fix some joint policy $\bm{\pi}$ and put ourselves in the shoes of some agent $i$. If the joint policy $\bm{\pi}_{-i}$ of the other agents is known and fixed, then agent $i$’s path is straight-forward. To maximize its return, it should follow a best-response to $\bm{\pi}_{-i}$: a marginal policy $\pi_i$ which maximizes $v_i$ assuming $\bm{\pi}_{-i}$ will not change. Below, we will explicitly split $\bm{\pi} = (\pi_i, \bm{\pi}_{-i})$.

$\qquad \qquad \qquad \quad \text{Fix } i\in\{1, \ldots, n\} \text{ and let }\bm{\pi} = (\pi_i, \bm{\pi}_{-i})$. $\text{The best-response operator } \mathcal{T}_i^{\text{br}}\text{ is defined as:}$ $$ \mathcal{T}_i^{\text{br}}(\bm{\pi}_{-i}) = \argmax_{\pi_i} v_i(\pi_i, \bm{\pi}_{-i}) $$
Observe that a policy $\bm{\pi}_{-i}$ can admit several best-responses. The best-response operator $\mathcal{T}_i^\text{br}$ is therefore a set-valued function. Given some joint policy $\bm{\pi}$ we define the joint best-response set as: $$ \mathcal{T}^{\text{br}}(\bm{\pi}) := \mathcal{T}_1^{\text{br}}(\bm{\pi}_{-1}) \times \ldots \times \mathcal{T}_n^{\text{br}}(\bm{\pi}_{-n})\; . $$

Nash Equilibrium

The concept of Nash equilibrium is relatively straight-forward, yet arguably one of the most influential in game theory. A joint policy $\bm{\pi}$ is a Nash equilibrium if each marginal policy $\pi_i$ is a best-response to the remaining $\bm{\pi}_{-i}$. Intuitively, this comes with plenty of stability: for any agent, even if he knew the strategies of the others, he would have no advantage to change his. As a result, no agents have a-priori some interest to deviate from the current joint strategy.

$\qquad \qquad \qquad \qquad \; \bm{\pi}\text{ is a Nash-equilibrium if:}$ $$ \forall i\in\{1, \ldots, n\}, \; \pi_{-i} \in \mathcal{T}^\text{br}_i(\bm{\pi}_{-i})\; . $$
Nash Equilibrium

In other words, a Nash-equilibrium policy $\bm{\pi}$ is such that $ \bm{\pi} \in \mathcal{T}^\text{br}(\bm{\pi}) \; . $

$\quad$ To ease exposition, we will now restrict our attention to simple matrix games $\mathcal{M} = (n, \mathcal{A}, \bm{r})$. The Nash equilibrium is a fairly portable concept, and most of the results below can be generalised to more complex games (although, often, at the price of some simplifying hypothesis).

The first question that pops into our mind is whether such policies exist. That’s the point of Nash’s theorem, which we will state and prove shortly. Before that, let’s establish a few useful results. Below, we denote $\Pi_i = \Delta(\mathcal{A}_i)$ the policy space for agent $i$, and $\bm{\Pi} = \prod_{i=1}^n \Pi_i$ the joint-policy space. Our first claim establishes that there always exists a deterministic best-response policy.

Claim 1. For every $\bm{\pi}\in\bm{\Pi}$ and $i\in\{1, \ldots n\}$ there exists a deterministic policy in $\mathcal{T}^\text{br}_i(\bm{\pi}_{-i})$.


Our second claim gives us a tool to identify some best-response stochastic policies. More precisely, it establishes that if some policy $\pi_i$ cannot be improved by any deterministic policy, then it is a best-response. To formalise this, let us introduce the coefficients: $$ \alpha_{ij}(\bm{\pi}) = \max(0, v_i(a_j, \bm{\pi}_{-i}) - v_i(\bm{\pi})), \qquad j\in\{1, \ldots, \vert \mathcal{A}_i\vert\}\; . $$

Claim 2. For every $i\in\{1, \ldots n\}$ and $\bm{\pi}=(\pi_i, \bm{\pi}_{-i})\in\bm{\Pi}$, if $\max_{j\in\{1, \ldots, n\}} \alpha_{ij}(\bm{\pi}) = 0$ then $\pi_i \in\mathcal{T}_i^\text{br}(\bm{\pi}_{-i})$.


We are now ready to claim and prove Nash’s theorem.

$\qquad \qquad \qquad \quad \text{For every game with a finite amount of player ($n<\infty$), each with a with finite number of }$ $\text{ actions } (\vert \mathcal{A}\vert < \infty) \text{, there exists at least one Nash equilibrium.}$
Nash Theorem

A game can have several, distinct Nash equilibria–each with different expected returns for each agent. There exists many variation to this solution concept (e.g., $\varepsilon$-Nash, correlated equilibrium) which we will not cover here. Their point is to describe equilibria that are compatible with rounding errors, extra information, ..

Maximin and Minimax

This section is concerned with two-players ($n=2$) zero-sum games. The maximin and minimax concepts seamlessly generalise to other games, but we will keep the focus on two-player zero-sum games for simplicity.

  • The maximin strategy for some agent $i$ consists in the following rationale: follow a best-response policy, assuming that the other agent is acting to minimize my return. In other words, play a policy that maximize agent $i$’s pay-off, assuming the other agent is out to cause him the greatest harm. The maximin policy for agent $i$ is formally defined defined as: $$ \bar{\pi}_i \in \argmax_{\pi_i} \min_{\pi_{-i}} v_i(\pi_i, \pi_{-i})\;, $$ and the maximin value as $\max_{\pi_i} \min_{\pi_{-i}} v_i(\pi_i, \pi_{-i})$. The maximin value is also sometimes referred to as the security level. It is the minimum value agent $i$ can guarantee, regardless of its opponent’s strategy.

  • The minimax strategy sees things the other way around. When agent $i$ follows a minimax strategy, it abandons the maximisation of its own return in order to minimize its opponent’s value. Formally, the minimax policy for agent $i$ is defined as: $$ \pi_i \in \argmin_{\pi_i} \max_{\pi_{-i}} v_{-i}(\pi_i, \pi_{-i})\;, $$ and the minimax value as $\min_{\pi_i} \max_{\pi_{-i}} v_i(\pi_i, \pi_{-i})$. The minimax value is the maximum return agent $i$ can force on its opponent.

A maximin (resp. minimax) solution concept is a joint policy $\bm\pi$ where each player follows a maximin (resp. minimax) strategy. In two-player zero-sum games, the two concepts blend with one another to yield the so-called minimax solution.

$\qquad \qquad \qquad \qquad \; \text{In a zero-sum game, a joint policy }\bm\pi=(\pi_1, \pi_2) \text{ is a minimax solution if:}$ $$ \begin{aligned} v_1(\bm\pi) &= \max_{\pi_1} \min_{\pi_2} v_1(\pi_1, \pi_2)\;, \\ &= \min_{\pi_1}\max_{\pi_2}v_1(\pi_1, \pi_2) \;, \\&= -v_2(\bm\pi) \; . \end{aligned} $$
Minimax solution

In a minimax solution, the minimax and maximin values of agent $i$ coincide. Such a solution always exists in finite games. Below, we claim and prove the Minimax Theorem, which establishes this result for matrix games. The minimax solution concept has tight links with the Nash equilibrium; actually, in a two-player zero-sum games, every Nash equilibrium is a minimax solution (see the proof).

$\qquad \qquad \qquad \qquad \quad \text{Any finite zero-sum matrix game with two agents admits a minimax solution.}$
Minimax Theorem


This blog post material was taken from [Shoham and Leyton-Brown, Chapter 3].