Distributional Dynamic Programming
Control in MDPs is often focused on the expected return criterion, which has the good taste to follow the Bellman dynamic programing equations we all love and cherish. In this post, we see how those equations generalise beyond the expected return to the full return distribution. We will cover the distributional Bellman equations for policy evaluation, and see them in action in practical distributional dynamic programming algorithm based on quantiles.
Let us consider a finite MDP $\mathcal{M} = (\mathcal{S}, \mathcal{A}, r, p)$ and $\pi:\mathcal{S}\mapsto\Delta(\mathcal{A})$ some stationary policy. We denote $R = \sum_{t\geq 1} \lambda^{t-1} r(s_t, a_t)$ the return associated with the stochastic process obtained when running $\pi$ on $\mathcal{M}$. When evaluating $\pi$ under a discounted objective, we score $\pi$ via the expected value of $R$–a.k.a the value-function: $$ \tag{1} \begin{aligned} v^\pi(s) := \mathbb{E}^\pi\left[R\,\middle\vert\, s_1=s\right]\; . \end{aligned} $$ The expectation is convenient – most control and RL algorithms heavily rely on its underlying properties, captured by the classical Bellman equations. It, however, does average out many subtleties of the reward process collected under $\pi$. Such nuances can be quite relevant when one decides to e.g. adopt & risk-averse profile. In this post, we follow the distributional RL literature and set out to characterise and model the entire distribution $\mu_\lambda^\pi$ of the return.
Preliminaries
Definitions
For any stationary $\pi$ the return distribution $\mu_\lambda^\pi$ maps $\mathcal{S}$ to $\Delta(\mathbb{R})$, the space of distributions over the real line. For concreteness, fix $s\in\mathcal{S}$ and denote for an instant $\mu = \mu_\lambda^\pi(s)$; then for any $\mathcal{R}\subseteq\mathbb{R}$: $$ \tag{2} \mu(\mathcal{R}) := \mathbb{P}^\pi\left(R \in \mathcal{R} \, \middle\vert \, s_1=s\right)\; . $$
We will denote $R(s)$ the return obtained by conditioning $s_1=s$ and following the reward process induced by $\pi$ (we drop the dependency to $\pi$ to reduce clutter.) A simple decomposition of the return tells us that for any transition $(s, a, s’)$: $$ \tag{3} R(s) \overset{d}{=} r(s,a) + \lambda R(s^\prime)\;, $$ where the notation $\overset{d}{=}$ indicates an equality of distribution between the left and right hand sides. This is an equality (in distribution) between two random variables; it does not get us through the finish line as it is not a Bellman-like equation for the return distribution $\mu_\lambda^\pi$. To get there, we will need a few extra tools that we detail below—without much context for now.
Push-forward operator
Let $f:\mathbb{R}\mapsto \mathbb{R}$ a Borel-measurable function. The notation $f_\sharp$ denotes the push-forward operator by $f$: $$ \begin{aligned} f_\sharp : \Delta(\mathbb{R}) &\mapsto \Delta(\mathbb{R}) \;,\\ \mu &\mapsto f_\sharp\mu \end{aligned} $$ where $f_\sharp\mu(\mathcal{R}) = \mu(f^{-1}(\mathcal{R}))$ for any $\mathcal{R}\in\mathbb{R}$.
Wasserstein distance
We will equip $\Delta(\mathbb{R})$ with some Wasserstein metric. For $p>1$ the $p$-Wasserstein distance between $\mu$ and $\nu$ is generically defined via the coupling $\Gamma(\mu, \nu)$ – the set of joint distributions which marginals are $\mu$ and $\nu$: $$ \omega_p(\mu, \nu) = \inf_{\gamma\in\Gamma(\mu, \nu)} \mathbb{E}_{(X,Y)\sim\gamma} \left[\vert X-Y\vert^p\right]^{1/p}\; . $$ For distribution on the real line this simplifies to involve only the (inverse) cumulative distribution functions: $$ \omega_p(\mu, \nu) = \left(\int_{0}^1 \left\vert F_{\mu}^{-1}(z) - F_{\nu}^{-1}(z)\right\vert ^pdz\right)^{1/p}\; . $$
Distributional Bellman equations
Our road to devising algorithms that can compute $\mu_\lambda^\pi$ is going to ressemble the “classical” one, that we follow to compute the optimal value function for an MDP – the tools will be quite different, though. Up to some hurdles we will have to jump through along the way, the protocol will be to
- show that $\mu_\lambda^\pi$ is the fixed-point to some contractive operator,
- compute it via some fixed-point iterations.
A fixed-point equation
The fixed-point characterisation of $\mu_\lambda^\pi$ is the distributional equivalent of (3). For any $s, a\in\mathcal{S}\times\mathcal{A}$ let us introduce an operator which push-forward will prove useful: $$ \begin{aligned} T_\lambda^{s, a} : \mathbb{R} &\mapsto \mathbb{R} \;, \\ z &\mapsto r(s, a) + \lambda z \; . \end{aligned} $$ The push-forward $(T_\lambda^{s, a})_\sharp$ manifestly allows us to represent (distributionally) stepping one step ahead in the reward process. Without further due, we will now state the desired result and move forward with a proof.
We will sometimes write (4) as $\mu_\lambda^\pi(s) = \mathbb{E}^\pi_s\left[(T_\lambda^{s, a})_\sharp \mu_\lambda^\pi(s^\prime)\right]$, or equivalently use the distributional operator $\mathcal{T}_\text{d}\Delta:(\mathbb{R})^\mathcal{S} \mapsto \Delta(\mathbb{R})^\mathcal{S}$ where $\mathcal{T}_\text{d}^\pi(\mu)(s)=\sum_{a\in\mathcal{A}}\sum_{s^\prime\in\mathcal{S}} \pi(a\vert s)p(s^\prime\vert s, a)(T_\lambda^{s, a})_\sharp \mu_\lambda(s^\prime)$ for any $s\in\mathcal{S}$, so that: $$ \tag{5} \mu_\lambda^\pi = \mathcal{T}_\text{d}^\pi(\mu_\lambda^\pi)\; . $$
The distributional Bellman operator
Beyond the intellectual satisfaction, the identities (4–5) are only useful if they can be turned into an algorithm to compute $\mu_\lambda^\pi$. In this section, we make the first step towards this goal by characterising the contractive properties of the Bellman distributional operator $\mathcal{T}_\text{d}^\pi$. Contraction will happen under a metric that extends $\omega_p$ to $\Delta(\mathbb{R})^\mathcal{S}$. Concretely, we define: $$ \begin{aligned} W_p : \Delta(\mathbb{R})^\mathcal{S}\times \Delta(\mathbb{R})^\mathcal{S} &\mapsto\mathbb{R} \;, \\ (\mu, \nu) &\mapsto \max_{s\in\mathcal{S}} \omega_p(\mu(s), \nu(s)) \; , \end{aligned} $$ which is easily proven to be a valid metric over $ \Delta(\mathbb{R})^\mathcal{S}$. The following claim establishes that $\mathcal{T}_\text{d}^\pi $ is a $\lambda$-contraction under $W_p$. It is then a classical fixed-point result that $\mu_\lambda^\pi$ is the unique solution to (5).
It also promises us an algorithmic way forward by ensuring the convergence of the fixed point iteration $\mu_{t+1} = \mathcal{T}_\text{d}^\pi \mu$ towards $\mu_\lambda^\pi$, in the p-Wasserstein sense at least since $ \lim_{t\to\infty} W_p(\mu_t, \mu_\lambda^\pi) = 0\; . $
Distributional Dynamic Programming
Empirical representation
The previous section might convince us that we can naively follow the fixed-point iteration $ \mu_{t+1} = \mathcal{T}_\text{d}^\pi \mu $ to compute some good approximation of $\mu_\lambda^\pi$. This is conceptually true, but concretely it is not a trivial operation since $\mu_t$, a member of $\Delta(\mathbb{R})^\mathcal{S}$, cannot be a priori be represented within the finite memory of a computer. One way forward is to maintain distributions that are weighted Dirac combs – in other words, members of the family $\mathcal{F}$ of empirical distributions: $$ \mathcal{F}^\mathcal{S} := \Big\{ \mu(s) = \sum_{k=1}^{K(s)} p_k(s) \delta_{r_k(s)}, \; \sum_{k=1}^{K(s)} p_k(s) = 1, r_k(s)\in\mathbb{R}, \, K(s)\in\mathbb{N}\Big\}^\mathcal{S} \subset \Delta(\mathbb{R})^\mathcal{S}\; . $$ Observe that for any $\mu\in\mathcal{F}^\mathcal{S}$ we have $\mathcal{T}_\text{d}^\pi\mu\in\mathcal{F}^\mathcal{S}$. In other words, $\mathcal{F}$ is closed under $\mathcal{T}_\text{d}^\pi$ – any fixed-point iteration starting in $\mathcal{F}^\mathcal{S}$ will not stray away from it. We detail below some pseudocode detailing a single fixed-point iteration when maintaining empirical distributions, represented via the pair $(\mathbf{r}, \mathbf{p})$: $$ \begin{aligned} \mathbf{r} := \left[\{r_k(s)\}_{k=1}^{K(s)}\right]_{s=1}^{S}\;, \\ \mathbf{p} := \left[\{p_k(s)\}_{k=1}^{K(s)}\right]_{s=1}^{S}\;. \end{aligned} $$
Repeating this protocol ensures convergence (in the $W_p$ sense) to $\mu_\lambda^\pi$. However, this comes at a cost: the memory needed to maintain $\mu_t$ grows as $\mathcal{O}((SA)^t)$! This is clearly unacceptable but for prohibitively small values of $t$ – rendering the practical influence of this approach, well, void.
Quantile representation
In this section we study an alternative representation allowing for constant memory usage—at the price, inevitably, of some precision. Concretely, we aim to maintain return distributions $\mu$ in: $$ \mathcal{F}_q^\mathcal{S} := \Big\{\mu(s) = \frac{1}{m} \sum_{k=1}^K \delta_{r_k(s)},\, K\in\mathbb{N}, \, r_k(s)\in\mathbb{R} \,\forall k\Big\}\;. $$ This quantile representation essentially indexes $K$ atoms of equal weight for each state, therefore upholding a more reasonable $\mathcal{O}(KS)$ memory cost. This comes at a price: in particular, that $\mathcal{F}_q^\mathcal{S}$ is no longer closed under $\mathcal{T}_\text{d}^\pi$. To maintain fixed point iterates in $\mathcal{F}_q^\mathcal{S}$ we must go through a projection step after each application of the distributional Bellman operator. It makes sense for $\mathcal{P}_q$ to be a projection mapping according to some well-chosen metric. Below, we choose the $\omega_1$ distance; formally, for $\mu\in\Delta(\mathbb{R})^\mathcal{S}$ and any $s\in\mathcal{S}$: $$ (\mathcal{P}_q \mu)(s) \in \text{arg min}_{\zeta\in \mathcal{F}_q} \omega_1(\mu(s), \zeta)\; . $$
Conveniently enough, this projection happens to have a closed-form. For $\mu\in\Delta(\mathbb{R})^\mathcal{S}$ and $s\in\mathcal{S}$ $$ \tag{6} (\mathcal{P}_q \mu)(s) = \frac{1}{m} \sum_{k=1}^K \delta_{F_{\mu(s)}^{-1}(\frac{2k-1}{2K})}\; . $$ In plain words, the atoms of the projected distributions are given by evenly space quantiles of $\mu(s)$.
Quantile Dynamic Programming
The quantile Dynamic Programming procedure is essentially a combination of the distributional Bellman operator with the quantile projection. Concretely, it followes the iterative protocol: $$ \tag{7} \mu_{t+1} = \mathcal{P}_q \mathcal{T}_\text{d}^\pi\mu_t\; , $$ where $\mathcal{P}_q$ is defined in (6). In plain words, after each application of the Bellman distributional operator, we project the resulting distribution back to $\mathcal{F}_q^\mathcal{S}$. It is straight forward to adapt the pseudocode given above to obtain a working and computationally feasible algorithm. We still have two questions to answer, however.
- Does the sequence $\{\mu_t\}_t$ produced by (7) converge?
- If yes, what is the approximation error to $\mu_\lambda^\pi$?
The first question is answered by the affirmative. Indeed, one can show that in the $\mathcal{P}_q$ is non-expansive in the $W_\infty$-sense. As a result, $\mathcal{P}_q \circ \mathcal{T}_\text{d}$ contracts and convergence to the unique fixed-point of $\mathcal{P}_q \circ \mathcal{T}_\text{d}$ is guaranteed.
The second question basically wonders if said fixed point could be anywhere near what we actually want to compute, which is $\mu_\lambda^\pi$. The result, that we will not prove here (but check the reference below!) has a very similar flavour to the ones we get when studying approximated dynamic programming: $$ W_\infty(\mu_\infty, \mu_\lambda^\pi) \leq \frac{W_\infty(\mathcal{P}_q \mu_\lambda^\pi, \mu_\lambda^\pi)}{1-\lambda}\;. $$ Essentially, the error bound depends on how well $\mu_\lambda^\pi$ is approximated by its projection on $\mathcal{F}_q^\mathcal{S}$.
References
This post is basically a condensed version of chapters 1–5 of the excellent book:
[1] Distributional Reinforcement Learning. Bellemare M, Dabney W, Rowland M, 2024.