Control in (Generally) Regularised MDPs

This post explores the theory of regularised MDPs beyond entropic regularisation (which we covered in an older post). We will introduce convex regularisation of the classical Bellman operators and study the induced regularised policy iteration algorithms. On the way, we will tie some links with several popular algorithms. This post is mostly a good excuse to refresh some convex optimisation classics.

Warm-up

We covered entropic regularisation in this post. It was motivated by a modification of the discounted return involving a policy’s differential entropy in its valuation. There, we introduced the Bellman operators associated to this objective, along with the value/policy iteration scheme that ensues. Below, we repeat this operation by working directly at the operator level. We will cover a collection of convex-regularised Bellman operators, each inducing their own set of control algorithms – entropic regularisation being a special case.

$\quad$ We will use indexed and vectorial notations interchangeably– see here for some background.

Let $\mathcal{M}=(\mathcal{S}, \mathcal{A}, p, r)$ be some finite MDP that we study under a discounted criterion. For any $v\in\mathbb{R}^\mathrm{S}$ denote $q\in\mathbb{R}^{\mathrm{S}\times\mathrm{A}}$ the associated $q$-values. We will denote $q(s)$ the associated vector in $\mathbb{R}^\mathrm{A}$, which entry $a$ is $q(s, a)$. Further, let $\Delta_\mathrm{A}$ the simplex over $\mathcal{A}$. For any stationary policy $\pi\in\Delta_{\mathrm{A}}^\mathrm{S}$ and $v\in\mathbb{R}^\mathrm{S}$, the Bellman evaluation and optimality operators write, respectively, $ \mathcal{T}_\lambda^\pi(v) := r_\pi + \lambda \mathbf{P}_\pi v $ and $ \mathcal{T}_\lambda^\star(v) := \max_\pi\mathcal{T}_\lambda^\pi(v). $ Rewriting the former using state-indexed notations reveals a linear structure w.r.t the policy and the $q$-values: $$ \mathcal{T}_\lambda^\pi(v)(s) = \sum_{a} \pi(a\vert s)\big[r(s, a) + \sum_{s^\prime\in\mathcal{S}} r(s^\prime\vert s, a)v(s^\prime)\big] = \pi(s)^\top q(s)\;.\tag{1} $$ Similarly, the optimality operator exists as the solution of a linear program via $\mathcal{T}_\lambda^\star(v)(s) = \max_{\pi} \pi(s)^\top q(s)$.

Regularised MDPs

Throughout, $\Omega:\Delta_\mathrm{A}\mapsto\mathbb{R}$ refers to some strongly convex function, that we assume to be smoothly differentiable for simplicity. Concretely, $\Omega$ is lower-bounded by a quadratic form; it exists some $\alpha>0$ such that for any $y$ and $x\in\Delta_\mathrm{A}$ we have $ \Omega(y) > \Omega(x) + \nabla\Omega(x)^\top(y-x) + \alpha\|y-x\|^2\;. $

Convex conjugate

We will start with some convex optimisation refreshers. The convex conjugate $\Omega^\star:\mathbb{R}^\mathrm{A}\mapsto\mathbb{R}$ is defined as: $$ \Omega^\star(y) := \max_{x\in\mathbb{R}^\mathrm{A}} \left\{ x^\top y - \Omega(x)\right\}\;. \tag{2} $$

The convex conjugate is sometimes called the Legendre-Fenchel transformation. There are several great resources online to gain intuition on this object– here, we will focus directly on some key properties. First, observe that the definition (2) is valid thanks to $\Omega$’s strong convexity: the maximum exists and is unique. Second, $\Omega^\star$ is convex and differentiable. Finally, we have a useful characterisation of $\nabla\Omega^\star$; for any $y\in\mathbb{R}^d$ : $$ \nabla\Omega^\star(y) = \argmax_{x\in\mathbb{R}^\mathrm{A}} \left\{ x^\top y - \Omega(x)\right\}\;. \tag{3} $$

Proof
$\quad$ There are countless excellent resources to gain intuition on convex conjugates. We do not cover them here, in an effort to keep the focus on its applications for our purposes.

Regularised operators

We can now introduce a regularised Bellman operator involving $\Omega$. This might seem somewhat arbitrary for now–nonetheless, for $\pi\in\Delta_\mathrm{A}^\mathrm{S}$ some stationary policy and $v\in\mathbb{R}^\mathrm{S}$, define: $$ \mathcal{T}_\Omega^\pi(v)(s) := \pi(s)^\top q(s)-\Omega(\pi(s))\;.\\ \tag{4} $$

This operator retains the original’s desirable properties, such as linearity, monotonicity, and, most importantly, contraction. It therefore admits a unique fixed-point, which can be retrieved via fixed-point iterations. We denote $v_\Omega^\pi$ this fixed-point, which from now on will stand as a regularised value-function.

Proof
Examples

The regularised Bellman optimality operator naturally follows, and it involves the convex conjugate $\Omega^\star$. It is defined by symmetry to the unregularised case, where $\mathcal{T}_\lambda^\star(v) = \max_\pi \mathcal{T}_\lambda^\pi(v)$. For any $v\in\mathbb{R}^\mathrm{S}$ and $s\in\mathcal{S}$: $$ \tag{5} \mathcal{T}_\Omega^\star(v)(s) := \max_{\pi} \big\{ \pi(s)^\top q(s) - \Omega(\pi(s)) \big\} = \Omega^\star(q(s))\;. $$ Because $\mathcal{T}_\Omega^\pi$ had the same properties as $\mathcal{T}_\lambda^\pi$, we can repeat most of the argument developed here to prove that it also is a contracting mapping. We will denote $v_\Omega^\star$ its unique fixed-point, and call that object the optimal regularised value. Thanks to (3), given some $v\in\mathbb{R}^\mathrm{S}$ (and its associated $q\in\mathbb{R}^{\mathrm{S}\times\mathrm{A}}$), the greedy regularised policy, a.k.a the one which attains the maximum in (5), is provided by the convex conjugate’s gradient. Generalising from the non-regularised case, the optimal $\Omega$-regularised policy is defined as the only $v_\Omega^\star$-improving policy; concretely, $\pi_\Omega^\star(s) := \nabla\Omega^\star(q_\Omega^\star(s))$ for every $s\in\mathcal{S}$.

Finally, it is reasonable to wonder what is the impact of regularisation on actual performance, as measured by discounted return. This can be directly quantified; with $u_\Omega$ (resp. $\ell_\Omega$) is $\Omega$’s upper (resp. lower) bound: $$ v_\lambda^\star - \frac{u_\Omega - \ell_\Omega}{1-\lambda}- \leq v_\lambda^{\pi^\star_\Omega} \leq v_\lambda^\star \;. $$

Proof

Algorithms

$\quad$ A refresher on control algorithms in MDPs can be found here.

Regularised Policy Iteration

Armed with regularised Bellman operators, we can walk again the route for computing $v_\Omega^\star$ and/or $\pi_\Omega^\star$. By the contractive nature of such operators, the story is going to be quite similar to the non-regularised case. For instance, a regularised value iteration writes $v_{\Omega}^{k+1} = \mathcal{T}_\Omega^\star(v_{\Omega}^k)$ and this process will converge to $v_\Omega^\star$. Similarly, a regularised policy iteration scheme executes $\pi_{k+1} = \nabla\Omega^\star(v_{\pi_k})$ – and will converge to $\pi_\Omega^\star$. We do not provide proof for said claims; one can directly extend the ones developed in the unregularised setting. Finally, a regularised generalised policy iteration protocol will look like, for some $n\in\mathbb{N}$: $$ \left\{ \begin{aligned} v_{k+1} &= \underbrace{\mathcal{T}_\Omega^{\pi_k}\circ\mathcal{T}_\Omega^{\pi_k} \circ \ldots \mathcal{T}_\Omega^{\pi_k}}_{n}(v_k)\;,\\ \pi_{k+1} &= \nabla\Omega^\star(v_{k+1})\;. \end{aligned}\right. $$

Note

With approximations

Beyond classical control algorithms, some modern approaches can be reformulated to fall within the family of regularised algorithms. The most famous examples are derived from entropic regularisation–that is, $\Omega(p) = \sum_{a\in\mathcal{A}} p_a\log p_a$ and $\Omega^\star(p) = \log[\sum_{a\in\mathcal{A}}\exp(p_a)]$ being the smoothed maximum. For instance, the soft q-learning of [2] can be seen as a extension of the $\Omega$-regularised value iteration algorithm (extend to support stochastic approximation and function approximation). There, the main difficulty comes from the evaluation of $\Omega^\star$ for continuous action spaces (it involves an integral over $\mathcal{A}$). Similarly, the soft actor-critic of [3] can be seen as the generalised policy-iteration scheme with the entropic regulariser.

Extension

Mirror descent

Let us take a short break to go through some additional convex optimisation refreshers. In what follows, for any $x, y\in\mathbb{R}^\mathrm{A}$ we denote $d_\Omega(x \|\ y)$ the Bregman divergence of $x$ wrt $y$ associated to $\Omega$. That is: $$ d_\Omega(x \|\ y) = \Omega(x) - \Omega(y) - \nabla\Omega(y)^\top(x-y)\;, $$ which is the difference between the value at $x$ of $\Omega$ and its first-order approximation around $y$. Observe that by strong convexity of $\Omega$, the Bregman divergence is a positive, strongly convex function of $x$. It equals 0 if and only if $x=y$ but (in general) is not symmetric and breaks the triangle inequality, so it is not a distance.

Examples

The mirror descent algorithm is a generalisation of the gradient descent algorithm, but instead of minimising a quadratic lower-bound to the original objective, it leverages a Bregman divergence surrogate. Concretely, the mirror descent update trades-off between a linear approximation of some objective function $f:\mathbb{R}^\mathrm{A}\mapsto\mathbb{R}$ around the current iterate $x_t$ and the Bregman divergence associated with $\Omega$ and wrt $x_t$: $$ \tag{6} x_{t+1} := \min_x \big\{ f(x_t) + \nabla f(x_t)^\top(x-x_t) + d_\Omega(x\|x_t)\big\}\;. $$ Observe that when $\Omega = \|\cdot\|^2$ we retrieve the gradient descent update. The point of (6) is to allow the update’s regulariser to better capture the geometry induced by $f$ and its feasible set.

Note
$\quad$ This is called the proximal view of mirror descent. There is also a fascinating mirror-map view which generalises gradient descent beyond Hillbert spaces (for which the representer theorem holds).

Dynamic regularisation

We so far pursued static regularisers $\Omega(\pi)$ which, in spite of potential benefits (e.g. additional stability when applied to policy optimisation), do bias the fixed point away from the original discounted objective’s solution. In an effort to maintain satisfying asymptotic performance, it is tempting to adjust the regularisation dynamically. We can, for instance, encourage the next iterate of a (generalised) policy improvement step to remain ‘close’ to the current policy. Proximity can be measured by, e.g, the Bregman divergence $d_\Omega(\cdot\|\pi_t)$, and plugged in the regularised operators introduced in the previous section. For instance, the improvement step resembles a mirror descent update, as for all $s\in\mathcal{S}$: $$ \pi_{t+1}(s) \in \argmax_\pi \big\{\pi(s)^\top q(s) - d_\Omega(\pi(s)\|\pi_t(s))\big\}\;. $$

Several modern algorithms fall into the related family of regularised approaches – such as PPO and MPO, which both uses a Kullback-Leibler divergence as a regulariser (and, of course, diverge from each other on many other aspects – such as policy evaluation).

Reference

This entire blog post is essentially my retranscription of the excellent paper:

[1] A Theory of Regularized Markov Decision Processes. Geist et al, 2019.

Other references include:

[2] Reinforcement Learning with Deep Energy-Based Policies. Haarnoja et al, 2017.
[3] Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Haarnoja et al, 2018.