A PPO Saga
For better or for worse, proximal policy optimisation (PPO) algorithms and its variants are dominating the RL landscape these days. This post aims at retracing their journey, from foundational concepts to LLM-savy innovations. We will start this saga on the theoretical trail, which we will progressively abandon to pay closer attention to algorithmic aspects.
Warm-up
Preliminaries
Let $\mathcal{M} = (\mathcal{S}, \mathcal{A}, p, r)$ be the finite MDP of interest. The finiteness assumption is without loss of generality and adopted for the sake of simplicity; all arguments hold when replacing sums with integrals (up to a dominated convergence theorem application here and there). We study $\mathcal{M}$ through a discounted objective: $$ \tag{1} v_\pi := \mathbb{E}^\pi_\nu\Big[\sum_{t\geq 1}\lambda^{t-1}r(s_t, a_t)\Big]\;, $$ where $\nu$ is the starting state’s distribution and $\pi$ is a stationary policy. The state-action values will be denoted as $q_{\pi}$, and the advantage function $\text{a}_\pi$. We use state-indexed notation like $v_\pi(s)$, when the starting state is conditioned to $s_0=s$. Throughout this post, we will alternate between using time-indexed definitions and their equivalent state-occupancy counterpart. For instance, by defining $ d_\pi(s) := \sum_{t\geq 1} \lambda^{t-1} \mathbb{P}^\pi_\nu(s_t=s) $ the discounted state occupancy measure, we can re-write the value of $\pi$ as:
$$ \tag{2} v_\pi = \sum_{s}d_\pi(s)\sum_{a} \pi(a\vert s)r(s, a)\;. $$
Policy Iteration
Let’s start with a refresher on the policy iteration (PI) algorithm. At every round $t$, PI starts by a policy evaluation step. It consist of computing $v_{\pi_t}(s)$ for all $s\in\mathcal{S}$ (in finite MDPs, this amounts to solving a linear equation) which in turn directly gives us access the state-action values $q_\pi(s, a)$. PI then proceeds with a policy improvement step, which creates a new deterministic policy by greedily following the best action prescribed by the state-action values of the previous iterate: $$ \tag{3} \pi_{t+1}(s) = \argmax_{a\in\mathcal{A}} q_{\pi_t}(s, a)\;. $$
One particularly useful property of this algorithm is monotonic improvement. Concretely, if $(\pi_1, \ldots, \pi_t)$ some sequence of stationary policies generated by PI, then it is guaranteed that $v_{\pi_1} \leq \ldots \leq v_{\pi_t}$. This property namely ensures that PI converges in finite time in finite MDPs.
Policy Iteration is a foundational concept to countless RL algorithms – virtually all actor-critic methods. Thanks to its monotonic improvements, it, however, enjoys something deep RL approaches envy: stability.
Improving Policies
When parametrising policies via deep neural networks, one cannot directly perform the policy improvement step (3) and must rely on policy optimisation instead. Under function approximation, that (3) holds for every state is unrealistic. Instead, we set out here for an alternative way to capture monotonic improvement. To this end, consider the following way to express the value difference between two stationary policies $\pi, \, \pi^\prime$:
$$ \tag{4} \begin{aligned} v_{\color{black}\pi^\prime} - v_{\color{black}\pi} &= \mathbb{E}^{\color{black}\pi^\prime}_\nu\Big[\sum_{t \geq 1}\lambda^{t-1} \text{a}_{\color{black}\pi}(s_t, a_t)\Big]\;,\\ &= \sum_{s}d_{\color{black}\pi^\prime}(s) \sum_{a} {\color{black}\pi^\prime}(a\vert s) \,\text{a}_{\color{black}\pi}(s, a)\;. \end{aligned} $$
(4) reveals that a sufficient condition for policy improvement (i.e. $v_{\pi^\prime}\geq v_\pi$) is that $\pi^\prime$, on average, selects actions that come with a positive advantage under $\pi$. In other words, that for every state $s\in\mathcal{S}$: $$ \sum_{a} \pi^\prime(a\vert s) \,\text{a}_\pi(s, a) \geq 0 \;. $$
It is tempting to use (4) as a basis for algorithmic design. By maximising the r.h.s, we get a formula for monotonic improvements. However, $d_{\pi^\prime}$ depends on $\pi^\prime$ in intricate ways, not ideal for optimisation. Also, it would be more economical to reuse whatever samples were drawn from $\pi$ to estimate its advantage $\text{a}_\pi$ during a policy evaluation step. To meet this objective, we can decide to optimise the following alternative: $$ \tag{5} \ell ({\color{black}\pi^\prime}\vert {\color{black}\pi}) := \sum_{s}d_{\color{black}\pi}(s) \sum_{a} {\color{black}\pi^\prime}(a\vert s) \,\text{a}_{\color{black}\pi}(s, a)\;. $$
Observe that we now sample under the state-occupancy distribution induced by $\pi$, easing estimation and optimisation. Optimising (5) sure is handy, but as we moved away from (4) there is no longer any improvement guarantee, at least at a first glance. In the following section, we cover a lower-bound which motivates maximising $\ell(\pi^\prime\vert\pi)$ over some trust-region, which will give us back improvement guarantees.
Performance Bound

Statement
Define the maximum total-variation distance between $\pi$ and $\pi^\prime$ as $ \|\pi-\pi^\prime\|_\infty:=\max_{s\in\mathcal{S}} d_\text{TV}(\pi(\cdot\vert s),\pi^\prime(\cdot\vert s))\;. $ We will be shamelessly hiding constants and expressions into the symbol $\small\blacksquare$; as in, not only universal constants but also functions of the discount, the policies, etc. in an effort to reduce clutter and ease comprehension. The following bound establishes that the performance gap between $\pi^\prime$ and $\pi$ can indeed be explained by $\ell(\pi^\prime\vert \pi)$, but gets looser as policies become too dissimilar – as measured by $\|\pi-\pi^\prime\|_\infty$.
This bound is tight at $\pi^\prime=\pi$; the inequality is an equality when policies match. This gives us back guaranteed improvement: if $\pi^\star$ maximises the r.h.s of (6), then $v_{\pi^\star}\geq v_\pi$. We illustrate this statement in Fig1 above. This motivates a new policy improvement step. With $\eta$ some pre-specified hyperparameter, at every round $t$ we can proceed by computing a new, improved policy via: $$ \tag{7} \pi_{t+1} \in\argmax_{\pi^\prime} \left\{\ell(\pi^\prime\vert\pi_t) - \eta \| \pi_t- \pi^\prime\|_\infty^2\right\}\;. $$
Proof
Let’s now move on proving (6). The first step is to relate $\ell(\pi^\prime\pi)$ to the state occupancy measures.
$$ \tag{8} \ell(\pi^\prime\vert \pi) = v_{\pi^\prime} - v_\pi + \sum_{s}(d_\pi(s)-d_{\pi^\prime}(s))\sum_{a}(\pi^\prime(a\vert s)-\pi(a\vert s))\,\text{a}_\pi(s, a) \;. $$
This first result can be derived into the following lower-bound: $$ \tag{9} v_{\pi^\prime} - v_\pi \geq \ell(\pi^\prime\vert \pi) - \blacksquare\|\pi-\pi^\prime\|_\infty\|d_\pi-d_{\pi^\prime}\|_1 \;. $$
To finish our bound, we must therefore bound the $\ell_1$ distance between two state occupancy measures. Somewhat naturally, the distance between $d_\pi$ and $d_{\pi^\prime}$ can be related by the distance between $\pi$ and ${\pi^\prime}$: $$ \tag{10} \| d_\pi - d_{\pi^\prime}\|_1 \leq \blacksquare\|\pi-\pi^\prime\|_\infty\;. $$
Putting everything together yields the announced bound.
Algorithms
The last section motivated a new policy improvement protocol; inspired by (6), it prescribes that at some round $t$ we maximise $\ell(\pi^\prime\vert \pi_t) -\eta ||\pi^\prime-\pi_t\|_\infty$ in order to produce the improved $\pi_{t+1}$. The parameter $\eta$ is a user-specified hyperparameter. We are getting closer to practical algorithms, but we need some extra steps.
Evaluating $\ell(\pi^\prime\vert \pi)$ involves a sum over all actions, which can itself be quite expensive. Ideally, we could write it as an expectation under $\pi$ only; this would allow us to resort to sample-average estimators to form $\ell(\pi^\prime\vert \pi)$ (concretely, juste sample from $\pi$). This is actually relatively easy, thanks to importance sampling: $$ \ell_\pi(\pi^\prime\vert \pi) = \mathbb{E}_{s\sim d_\pi}\mathbb{E}_{a\sim\pi(s)}\Big[\frac{\pi^\prime(a\vert s)}{\pi(a\vert s)}\, \text{a}_\pi(s, a)\Big]\;. $$
The maximum total variation $||\pi^\prime-\pi_t\|_\infty$ is also tricky to compute without enumerating all states. It can be replaced by an expected total variation under $d_\pi$; that is $\mathbb{E}_{s\sim d_\pi}[d_\text{TV}(\pi(s), \pi^\prime(s))]$, which can directly be estimated by sampling from trajectories generated by $\pi$. Turns out, this is not just a convenient approximation; it is also motivated by a variant of (6) that includes the expected total variation distance [4] . That’s it, we have a practical algorithm! It repeatedly goes through the improvement step: $$ \tag{11} \pi_{t+1} \in \argmax_{\pi} \mathbb{E}_{s\sim d_{\pi_t}}\Big[\mathbb{E}_{a\sim\pi_t(s)}\Big[\frac{\pi(a\vert s)}{\pi_t(a\vert s)}\, \text{a}_{\pi_t}(s, a)\Big]- \eta d_\text{TV}(\pi(s), \pi_t(s))\Big] $$
which is amenable to stochastic optimisation. The protocol is simple; generate data from $\pi_t$, use this to build a sample-average estimators for the objective in (11), compute $\pi_{t+1}$, repeat. Several algorithms spawn from this approach; they differ mostly in how they control the distance between two iterates $\pi_{t}$ and $\pi_{t+1}$.
TRPO
The total variation distance is not necessarily great to handle in numerical optimisation schemes; beyond not being smooth, it also does not always have a closed form – even for standard distributions. The authors of the TRPO paper [3] replaces it by the KL divergence. Observe that this can also be justified by a lower-bound, thanks to Pinsker’s inequality. TRPO also turns the unconstrained (11) into a constrained program: $$ \tag{12} \begin{aligned} \pi_{t+1} \in &\argmax_{\pi}& &\mathbb{E}_{s\sim d_{\pi_t}}\mathbb{E}_{a\sim\pi_t(s)}\Big[\frac{\pi(a\vert s)}{\pi_t(a\vert s)}\, \text{a}_{\pi_t}(s, a)\Big]\;, \\ & & &\text{s.t }\mathbb{E}_{s\sim d_{\pi_t}}\Big[d_\text{KL}(\pi(s), \pi_t(s))\Big] \leq \delta\;. \end{aligned} $$ where $\delta$ is an hyperparameter. A search direction is produced by a second-order method; the inverse Hessian of the constraints (a.k.a, here, the Fischer information matrix) is multiplied by the objective’s gradient. A line search is then performed along the search direction to maximise the objective under the constraint.
PPO
TRPO’s implementation does not stray too far from our principled arguments. However, solving (12) is relatively expensive as it goes through a second order method. The authors of PPO [5] introduced a rough but effective approximation. To understand it, observe that the expected total variation can be written as: $$ \mathbb{E}_{s\sim d_{\pi_t}}\Big[d_\text{TV}(\pi(s), \pi_t(s))\Big] = \mathbb{E}_{s\sim d_{\pi_t}}\mathbb{E}_{a\sim\pi(s)}\Big[\Big\vert \frac{\pi(a\vert s)}{\pi_t(a\vert s)}-1\Big\vert\Big]\;. $$
In other words, the (expected) total-variation will remain small as long as the ratio $\pi^\prime(a\vert s)/\pi(a\vert s)$ remains small, for state-action pairs generated by $\pi$. This motivated an improvement step which prevented the importance weights to stray too far from 1. Concretely, denote $\vert \cdot\vert_a^b$ the clipping operator: for any $a, b, x\in\mathbb{R}$ we have $|x|_a^b = \min(b, \max(a, x))$. Then, for $\varepsilon$ a small (close to 0) hyperparameter, the PPO objective writes: $$ \tag{13} \pi_{t+1} \in \argmax_{\pi}\mathbb{E}_{s\sim d_{\pi_t}}\mathbb{E}_{a\sim\pi_t(s)}\Big[\Big\vert\frac{\pi(a\vert s)}{\pi_t(a\vert s)}\Big\vert_{1-\varepsilon}^{1+\varepsilon}\, \text{a}_{\pi_t}(s, a)\Big]\;. $$ It can be optimised using first-order stochastic optimisation methods, at much lower cost than the TRPO objective. Modern implementations of PPO combine several other tricks or code-level optimisation; the main idea remains the same, and powers many recent RL successes.
SPO
One advantage that is often claimed for the PPO objective (13) is its simplicity. It does have a few holes, however. It was first demonstrated by [6] that importance weight clipping is a poor proxy for controlling the size of the policy update. There is a simple reason; assume that for some given state-action couple $(\tilde{s}, \tilde{a})$ we have $\pi(\tilde s, \tilde a) > (1+\varepsilon)\pi_t(\tilde s, \tilde a)$. For such a point, the gradients of (13) w.r.t $\pi$ are zero. In other words, there is no mechanism for keeping the propensity weights at $(\tilde{s}, \tilde{a})$ close to the boundary–at least not while using first-order methods. The influence of other state-action pairs can push them arbitrarily far from it.

Below, we denote the importance sampling weights as $\omega_\pi(s, a) := \pi(s, a) / \pi_{t}(s, a)$. The authors of SPO [7] proposed a simple fix to the original PPO idea, that offer a better control on the size of the policy update. Concretely, they propose to undergo the following policy improvement step: $$ \tag{14} \pi_{t+1} \in \argmax _\pi \mathbb{E}_{s\sim d_{\pi_t}}\mathbb{E}_{a\sim\pi_t(s)} \Big[\omega_\pi(s, a) \text{a}_\pi(s, a) - \frac{\vert \text{a}_\pi(s, a)\vert}{2\varepsilon}(\omega_\pi(s, a)-1)^2\Big]\;. $$ Perhaps the cleanest intuition as to why this constitutes a promising update rule is the following; observe that for a single action-pair (s, a), the updated importance weight will match the boundary exactly – and without going through any zero-gradient phase, whatever the weight’s initial value: $$ \min_\omega\big\{\omega_\pi(s, a) \text{a}_\pi(s, a) - \frac{\vert \text{a}_\pi(s, a)\vert}{2\varepsilon}(\omega_\pi(s, a)-1)^2\big\} = 1 + \text{sign}({\text{a}_\pi(s, a)})\varepsilon\;. $$
Leftovers
Today, PPO is the most widespread implementation of (11). Modern implementations, especially in the world of LLMs, also add a KL-penalty to further constraint the size of the policy update or avoid large deviation from an initial model. From a policy improvement perspective, that’s roughly how far it goes.
We have left the topic of advantage estimation on the side until now. Most practical implementations rely on Generalised Advantage Estimation (GAE) [8] . It has close ties with TD($\lambda$), and allows to strike a fine bias-variance trade-off in advantage estimation. Roughly, it relies on maintaining a model for the state value function $v_\pi$, and computes each advantage estimators $ \hat{\text{a}}_\pi^k(s_t, a_t) := \sum_{i=0}^{k-1} \lambda^i r(s_{t+i}, a_{t+i}) + \lambda^k v_\pi(s_{t+k}) \;. $ A final advantage estimate is obtained via an (exponential) weighted sum of said estimates.
Maintaining a state value function can be expensive, and in applications like LLM alignment, is not justified. LLM trainers seem to have fallen back to pure Monte-Carlo for advantage estimation. For instance, the authors of [9] rely on sampling several trials from the same state to form advantages for each action.