Control in Entropy Regularized MDPs
This post introduces the Maximum Entropy (MaxEnt) framework for Reinforcement Learning. The ambition is to see how “standard” control algorithms like value or policy iteration translate to the MaxEnt setting, and describe their convergence properties. We will also discuss how said control approach influenced the design of some popular modern algorithms such as the Soft Actor Critic (SAC).
Problem formulation
Let $\mathcal{M} = (\mathcal{S}, \mathcal{A}, p, r)$ an MDP. For simplicity we will assume throughout that $\mathcal{M}$ is finite: $\vert \mathcal{S} \vert \times \vert \mathcal{A}\vert <+\infty$. Let’s briefly remind ourselves of the “classical” discounted objective for solving MDPs (see this post for a complete refresher). With $\lambda \in [0, 1)$, recall that the discounted cumulative reward objective writes: $$ \tag{1} v_\lambda^\star := \max_{\pi\in\mathcal{S}^{\text{MD}}}\mathbb{E}^\pi\left[\sum_{t\geq 1} \lambda^{t-1} r(s_t, a_t) \right] \; . $$ Above, $\mathcal{S}^\text{MD}$ denotes the set of stationary, Markovian, deterministic policies. In this post we will only be concerned with stationary policies; we will profusely conflate the notion of policy and decision-rule.
Entropy regularisation
For entropy-regularised MDPs the objective is augmented with an entropy penalty. Formally, with $\alpha \in \mathbb{R}^+$: $$ \tag{2} v_{\lambda, \alpha}^\star := \max_{\pi \in \mathcal{S}^\text{MR}}\mathbb{E}^\pi\left[\sum_{t\geq 1} \lambda^{t-1} \{r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot\vert s_t))\}\right]\;, $$ where $\mathcal{H}(\pi(\cdot\vert s_t)) = -\sum_{a\in\mathcal{A}} \pi(a\vert s_t)\log\pi(a\vert s_t)$ is the entropy of $\pi(\cdot\vert s_t)$. Notice how we now optimise over the set of randomised stationary policies $\mathcal{S}^\text{MR}$; unlike the classical MDP case, we can now longer anticipate that the optimal policy is deterministic. Below, we will denote $\pi^\star_\alpha$ the maximum-entropy optimal policy. Notice that (2) can be exactly rewritten as an optimisation over a new MDP with a modified reward function: $$ \tag{3} v_{\lambda, \alpha}^\star := \max_{\pi \in \mathcal{S}^\text{MR}}\mathbb{E}^\pi\left[\sum_{t\geq 1} \lambda^{t-1} \{r(s_t, a_t) - \alpha \log\pi(a_t\vert s_t)\}\right]\;. $$ The entropy-regularised objective promotes policies that trade-off visiting highly rewarding (state, action) pairs with visiting states where the policy has kept its fair share of randomness. It is natural to ask why we would be more interested in solving (2) than (1). Ultimately, we are only concerned about the cumulative reward captured by (1). However, entropy regularisation in MDPs comes with several practical benefits. The obvious one is that it creates a natural framework to promote (shallow) exploration by encouraging the visit of states for which the policy is still fairly random. Further, a max-entropy policy can also keep track of several alternative ways to maximise the discounted reward; this is often an argument that it is more robust than its deterministic counterpart. Several research groups have observed that such policies generalise better under environmental shift [1] and are easier to fine-tune to new environments or cross a sim-to-real gap [2] . There is also an argument to be made for stability; by having policies off the edges of the simplex $\Delta(\mathcal{A}$), policy improvement objectives remain smoother—easing optimisation from a numerical standpoint.
Additional notations
Our objective in this blog-post is to understand what control algorithms (similar to value and policy iteration in MDPs) look like in entropy-regularised MDPs. We will need a couple of additional notations. For any policy $\pi\in\mathcal{S}^\text{MR}$ we will write its soft value function, for any $s\in\mathcal{S}$: $$ v_{\lambda, \alpha}^\pi(s) := \mathbb{E}_s^\pi\left[\sum_{t\geq 1} \lambda^{t-1} \{r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot\vert s_t))\}\right]\; . $$ As in classical MDP, the state-action value function writes $q_{\lambda, \alpha}^\pi(s, a) = r(s, a) + \mathbb{E}_{s^\prime\sim p(\cdot\vert s, a)}[v_{\lambda, \alpha}^\pi(s^\prime)]$, the optimal soft q-value function $q_{\lambda, \alpha}^\star$ being naturally defined via this identity.
Soft Value Iteration
To mirror our MDP control post, we will start by deriving the entropy-regularised version of the Value Iteration (VI) algorithm – called Soft Value Iteration (SVI). We will now go through the usual “dance”: introduce an operator $\mathcal{T}_{\lambda, \alpha}^\star$, show that the soft-optimal value function is a fixed-point of said operator and finally that said operator is contracting. As in the “classical” VI case, this will be enough to prove convergence of the sequence $q_{t+1} = \mathcal{T}_{\lambda, \alpha}^\star(q_{t})$ to $q_{\lambda, \alpha}^\star$.
Soft-Bellman Operator
The soft Bellman operator $\mathcal{T}_{\lambda, \alpha}^\star: \mathbb{R}^{\mathcal{S}\times\mathcal{A}} \mapsto \mathbb{R}^{\mathcal{S}\times\mathcal{A}}$ writes for any $q\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}}$ and $s, a \in \mathcal{S}\times\mathcal{A}$:
$$ \begin{aligned} \mathcal{T}_{\lambda, \alpha}^\star(q)(s, a) = r(s, a) + \lambda\alpha \mathbb{E}_{s^\prime \sim p(\cdot\vert s, a)} \left[\log \sum_{a^\prime \in \mathcal{A}} \exp(q(s^\prime, a^\prime)/\alpha)\right]\; . \end{aligned} $$ To make sense of this expression it is useful to compare it to the “hard” Bellman operator: $\mathcal{T}_{\lambda}^\star(q)(s, a) = r(s, a) + \lambda \mathbb{E}_{s^\prime} \left[\max_{a^\prime\in\mathcal{A}} q(s^\prime, a^\prime)\right]$. This is where the hard-soft distinction appears. It is easy to see that as $\alpha\to 0$, $\alpha \log \sum_{a^\prime \in \mathcal{A}} \exp(q(s^\prime, a^\prime)/\alpha) \to \max_{a^\prime\in\mathcal{A}} q(s^\prime, a^\prime)$. On the other hand, as $\alpha\to \infty$ we end up averaging the q-values over the action space. In other words, $\mathcal{T}_{\lambda, \alpha}^\star$ performs a soft-maximum with a temperature $\alpha$.
Algorithm
As in the non-regularised setting, the algorithm is a simple fixed-point iteration:
Convergence
Proving convergence of the Soft qVI algorithm to the soft-optimal q-function follows the proof mechanism of the non-regularised case. It starts by claim (and proving) that $q_{\lambda, \alpha}^\star$ is a fixed-point of $\mathcal{T}_{\lambda, \alpha}^\star$.
An important tool that we will use throughout this post is the soft-improvement update. Consider some policy $\pi\in\mathcal{S}^\text{MR}$ and define $\forall s, a \in\mathcal{S}\times\mathcal{A}$: $$ \tag{4} \pi^\prime(a\vert s) = \exp(q_{\lambda, \alpha}^\pi(s, a) /\alpha - z_{\lambda, \alpha}^\pi(s)/\alpha) \;, $$ with $ z_{\lambda, \alpha}^\pi(s) = \alpha \log \sum_{a\in\mathcal{A}} \exp(q_{\lambda, \alpha}^\pi(s, a)/\alpha)$ only plays the part of a normalising constant. You can indeed also write (1) through the more concise $ \pi^\prime(\cdot\vert s) \propto \exp(q_{\lambda, \alpha}^\pi(s, \cdot) / \alpha); . $ Through those equivalent definitions of $\pi^\prime$, we have introduced a policy soft-improvement operator. Indeed, for all $s, a \in\mathcal{S}\times\mathcal{A}$ we have: $$q_{\lambda, \alpha}^{\pi^\prime}(s, a) \geq q_{\lambda, \alpha}^\pi(s, a)\;.$$
This tool is useful to prove the announced claim. Let $\pi_\alpha^\star$ the soft-optimal policy. Since it cannot be soft-improved (by definition) it must be a fixed point to (1). Hence, for all $s, a\in\mathcal{S}\times\mathcal{A}$: $$ \tag{6} \pi_\alpha^\star(a\vert s) = \exp(q_{\lambda, \alpha}^{\pi_\alpha^\star}(s, a) /\alpha - z_{\lambda, \alpha}^{\pi_\alpha^\star}(s)/\alpha)\;, $$ This is enough to yield: $$ \begin{aligned} q_{\lambda, \alpha}^{\pi_\alpha^\star}(s, a) &= r(s, a) + \lambda \mathbb{E}_{s^\prime\sim p(\cdot \vert s, a)}\mathbb{E}_{a^\prime\sim \pi_{\lambda, \alpha}^\star(\cdot\vert s^\prime)} \left[q_{\lambda, \alpha}^{\pi_\alpha^\star}(s^\prime, a^\prime)- \alpha\log \pi_{\lambda, \alpha}^\star(a^\prime\vert s^\prime)\right]\:, &(\text{by def.})\\ &= r(s, a) + \lambda \mathbb{E}_{s^\prime\sim p(\cdot \vert s, a)} [z_{\lambda, \alpha}^{\pi_\alpha^\star}(s^\prime)] \;, &(\text{by (6)})\\ &= r(s, a) + \lambda\alpha \mathbb{E}_{s^\prime\sim p(\cdot \vert s, a)} \log \sum_{a^\prime\in\mathcal{A}} \exp(q_{\lambda, \alpha}^{\pi_\alpha^\star}(s^\prime, a^\prime)/\alpha)\; . \end{aligned} $$ In other words, $q_{\lambda, \alpha}^{\pi_\alpha^\star}$ is a fixed-point of $\mathcal{T}_{\lambda, \alpha}^{\star}$; for all $s, a\in\mathcal{S}\times\mathcal{A}$: $$ q_{\lambda, \alpha}^{\pi_\alpha^\star}(s, a) = \mathcal{T}_{\lambda, \alpha}^{\star}(q_{\lambda, \alpha}^{\pi_\alpha^\star})(s, a) \; . $$
As in the classical control case, what we have left to do is show that $\mathcal{T}_{\lambda, \alpha}^{\star}$ is a contracting operator. This is actually quite straight-forward: we can replicate the classical VI proof by incorporating the entropy inside the reward function – see (3). The convergence of Soft-qVI to $q_{\lambda, \alpha}^\star$ is then a formality (a replication of the Banach fixed-point theorem’s proof). Hence:
Soft Policy Iteration
Most of the groundwork needed to introduce and analyse Soft Policy Iteration (SPI) is now already done! Mirroring tge Policy Iteration algorithm, we will alternate between soft policy evaluation and soft policy improvement. Soft-policy evaluation is fairly straight-forward. By the identity: $$ q_{\lambda, \alpha}^\pi(s, a) = r(s, a) + \mathbb{E}_{s^\prime\sim p(\cdot\vert s, a)} \mathbb{E}_{a^\prime\sim \pi(\cdot\vert s^\prime)}\left[q_{\lambda, \alpha}^\pi(s^\prime, a^\prime) - \alpha \log \pi(a^\prime\vert s^\prime) \right] \; , $$ we can fall back to the classical policy evaluation framework and compute $q_{\lambda, \alpha}^\pi$ by solving a linear system. Alternatively, one can write the equivalent soft-policy operator $\mathcal{T}_{\lambda, \alpha}^\pi$ and go through a fixed-point algorithm to compute $q_{\lambda, \alpha}^\pi$ (after showing that $\mathcal{T}_{\lambda, \alpha}^\pi$ is contracting and that $q_{\lambda, \alpha}^\pi$ is its fixed-point). Soft policy improvement was already covered in (4); any stationary policy can be soft-improved by a policy $\pi^\prime$ where: $$ \tag{7} \pi^\prime(\cdot\vert s) \propto \exp(q_{\lambda, \alpha}^\pi(s, \cdot)) \; . $$
Algorithm
Convergence
Similarly to the classical PI, proving the convergence of SPI uses a monotonicity argument.
We already proved this result when proving the convergence of the Soft-qVI algorithm! Hence ${q_{\lambda, \alpha}^{\pi_{t}}}$ is a increasing, bounded sequence. It therefore converges; further, it must converge to $q_\lambda^\star$ – or else it could be further improved. That does it for the convergence of SPI!
RL algorithms
Like qVI can be turned into the practical RL algorithm that is Q-learning, Soft qVI can go through similar stochastic and functional approximation hoops (see here) to move from the control to the RL realm. The most salient difficulty comes from the inner-sum over the action space $\sum_a \exp q_t(s, a)$ present in the soft Bellman operator $\mathcal{T}_{\lambda, \alpha}^\star$. To avoid the computational cost of computing this term when the action space gets large, [1] suggest to approach it via importance sampling. Together with the design of a sampling network to achieve approach $a \sim \exp(q_t(s, \cdot)/\alpha)/\eta$ this yields their Soft Q-learning algorithm.
Similarly, the Soft Actor Critic [2] is the RL counterpart of the control algorithm that is Soft (Generalised) Policy Iteration. Apart from many tricks destined to stabilise RL training with neural networks, the main idea is to “project” the policy improvement step back to the parametric class $\Pi$ covered by the chosen neural network architecture: $$ \pi_{t+1}(\cdot\vert s) \in \argmin_{\Pi} \text{KL} \left (\pi(\cdot\vert s) \vert\vert \exp(q_{\lambda, \alpha}^{\pi_t}(s, \cdot) /\alpha)/\eta\right)\; . $$
Finally, the entropy-regularised objective (2) can easily be turned into a entropy-constrained one: $$ \begin{aligned} v_{\lambda, \alpha}^\star := \max_{\pi \in \mathcal{S}^\text{MR}}&\mathbb{E}^\pi\left[\sum_{t\geq 1} \lambda^{t-1} \{r(s_t, a_t)\}\right]\;,\\ &\text{s.t } \mathbb{E}^\pi[\mathcal{H}(\pi(\cdot\vert s_t))] \geq \beta\; . \end{aligned} $$ Being imposed by direct considerations over physical concept (the actual entropy) the parameter $\beta$ is easier to decide on (and does not have to be chosen depending on the reward’s scale). Concretely, training doesn’t change much—but for a dynamically updated value of $\alpha$, depending on current entropy values.
References
[1] Haarnoja & al, 2017. Reinforcement Learning with Deep Energy-Based Policies.[2] Haarnoja & al, 2019. Soft Actor-Critic Algorithms and Applications.[3] Geist & al, 2019. A Theory of Regularized Markov Decision Processes.[4] Vieillard & al, 2021. Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning.