A Tale of Policy Gradients

This post covers the derivation of several policy gradients expressions. It browses through the vanilla policy gradient and the policy gradient theorem to finish at deterministic policy gradients. No attention is given to optimisation aspects (whether and how policy gradients can find locally optimal policies): the focus is the gradient’s (semi) formal derivation.


We’ll consider a finite, stationary Markov Decision Process (MDP) $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}_{s}^a, r)$ under an infinite-horizon, discounted criterion. Recall the classical definitions for state / state-action value functions: $$ \begin{aligned} v_\lambda^\pi(s) &:= \mathbb{E}_s^\pi \left[ \sum_{t=1}^\infty \lambda^{t-1}r(s_t, a_t)\right]\; ,\\ q_\lambda^\pi(s, a) &:= r(s, a) + \mathbb{E}_{s’\sim \mathcal{P}_s^a}\left[v_\lambda^\pi(s’)\right] \; . \end{aligned} $$

for any $s, a \in \mathcal{S}, \mathcal{A}$, and $\pi$ some stationary policy. The notation $\mathbb{E}_s^\pi$ is fairly standard, and averages out any stochastic realisation (from the MDP or the policy), starting at state $s_1 = s$ and following $\pi$.


Since classical MDP theory [Puterman, 6.2.4] tells us that it exists an optimal stationary policy, we will stick with this policy class in this post. Also, in contrast with value-based methods, we will consider stochastic policies (more on that later). Finally, policies will be parametrized by some parameter $\theta$, living in some Euclidian space (typically the weights of a neural network).

Before moving forward, we must define a notion of utility for a given policy $\pi_\theta$ – that is, the criterion that we will effectively optimize. Assuming that the initial state of the MDP always is some state $s^0$, a natural definition of utility is the policy’s value at $s^0$. $$ U(\theta) := v_\lambda^{\pi_\theta}(s^0)\; . $$


Vanilla Policy Gradient

As we shall see shortly, the Vanilla Policy Gradient (VPG) is deeply rooted in Monte-Carlo estimation. It is therefore adapted to episodic settings: in what follows, we assume there is an absorbing state that every policy eventually reaches during the first $T$ rounds.

The interaction between the environment transition kernel $\mathcal{P}_s^a$ and the policy $\pi_\theta$ induces a probability measure $\mathcal{T}_\theta$ over trajectories $\tau = (s_1, a_1, \ldots, a_{T-1}, s_T)$. Notice that denoting $R_\tau:= \sum_{t=1}^\infty \lambda^{t-1}r(s_t, a_t)$ the total return of a trajectory $\tau$, we can rewrite $U(\theta)$ as: $$ U(\theta) = \mathbb{E}_{\tau\sim \mathcal{T}_\theta}[R_\tau] \; . $$ Our goal is to express $\nabla_\theta U(\theta)$ as an expectation under $\mathcal{T}_\theta$; this will allow us to efficiently compute stochastic gradients by sampling trajectories. It can then be fed to your favorite stochastic gradient based optimiser to generate better policies – according to $U(\theta)$.

Likelihood Ratio

The likelihood ratio is a neat little trick that allows us to do just that. In short, it asserts that: $$ \nabla_\theta U(\theta) = \mathbb{E}_{\tau\sim \mathcal{T}_\theta}\left[R_\tau \nabla_\theta \mathcal{T}_\theta(\tau)\right] \; . $$


Analyzing $\mathcal{T}_\theta$ shows that this can be further simplified. Indeed: $$ \begin{aligned} \mathcal{T}_\theta(\tau) &= \mathbb{P}_s^\pi(s_1, a_1, \ldots, a_{T-1}, s_T)\; ,\\ &\overset{(i)}{=} \mathbb{P}_s^\pi(s_T\vert s_1, a_1, \ldots, a_{T-1})\mathbb{P}_s^\pi(a_{T-1}\vert a_1, \ldots, s_{T-1})\mathbb{P}_s^\pi(s_1, a_1, \ldots, s_{T-1}) \; ,\\ &\overset{(ii)}{=} \mathcal{P}_{s_{_{T-1}}}^{a_{_{T-1}}}(s_T) \pi_\theta(a_{T-1}\vert s_{T-1})\mathbb{P}_s^\pi(s_1, a_1, \ldots, s_{T-1})\; ,\\ &\overset{(iii)}{=} \mathbf{1}[s_1 = s^0]\prod_{t=1}^{T-1}\mathcal{P}_{s_{_{t}}}^{a_{_{t}}}(s_{t+1})\pi_\theta(a_t \vert s_t )\; , \end{aligned} $$ where $(i)$ is given by Bayes rule, $(ii)$ uses the environment’s and policy’s Markovian nature and $(iii)$ unrolls until $t=1$. Therefore whenever $s_1 = s^0$: $$ \begin{aligned} \nabla_\theta \log\mathcal{T}_\theta(\tau) &= \nabla_\theta \sum_{t=1}^{T-1} \left[\log\mathcal{P}_{s_{_{t}}}^{a_{_{t}}}(s_{t+1})+ \log\pi_\theta(a_t \vert s_t )\right] \; ,\\ &= \sum_{t=1}^{T-1} \nabla_\theta\log\pi_\theta(a_t \vert s_t ) \; . \end{aligned} $$ Notice how we therefore do not need to explicitely know the transition kernel to compute this gradient. All in all, we are left with the following expression: $$ \nabla_\theta U(\theta) = \mathbb{E}_{\tau\sim \mathcal{T}_\theta}\left[R_\tau\sum_{t=1}^{T-1} \nabla_\theta\log\pi_\theta(a_t \vert s_t )\right]\; , $$ for which unbiased estimators are easily computed.

Reducing Variance

As usual with Monte Carlo methods, variance can be a painful thorn in our shoe. Turns out, stochastic estimators of $\nabla_\theta U(\theta)$ computed according to the above formula come with substantial variance. Thankfully we have a couple tricks up our sleeves to trim it, without having to introduce any bias.

Temporal Structure

Our current gradient expression can be further simplified by observing that is contains a bunch of terms that do not contribute in expectation. Indeed note that if $k<t$: $$ \begin{aligned} \mathbb{E}_{\tau\sim \mathcal{T}_\theta}\left[r(s_k, a_k) \nabla_\theta\log\pi_\theta(a_t \vert s_t )\right] = 0 \, . \end{aligned} $$


Denoting $G_\tau^t := \sum_{k=t}^T \lambda^{k-t}r(s_k, a_k)$ we obtain a new identity: $$ \nabla_\theta U(\theta) = \mathbb{E}_{\tau\sim \mathcal{T}_\theta}\left[\sum_{t=1}^{T-1} \lambda^{t-1}G_{\tau}^t\nabla_\theta\log\pi_\theta(a_t \vert s_t )\right]\; , $$


This makes a lot of intuitive sense: actions should be reinforced only based on future rewards (what they actually impact), not past ones.



Another typical way of reducing variance in Monte-Carlo estimation is to introduce baselines, a.k.a control variates in Monte-Carlo literature. The main idea is to add disturbances which do not contribute in expectation to the estimation problem. When this disturbance is correlated with the random variable which mean we are trying to approximate, this comes with a reduced variance of the Monte-Carlo estimator (see [Owen, 8.9]) for a detailed treatment).

For our focus, the main point is first to show that for any deterministic function $b:\mathcal{S}\mapsto \mathbb{R}$, the following identity holds: $$ \nabla_\theta U(\theta) = \mathbb{E}_{\tau\sim \mathcal{T}_\theta}\left[\sum_{t=1}^{T-1} \lambda^{t-1}\left(G_{\tau}^t-b(s_t)\right)\nabla_\theta\log\pi_\theta(a_t \vert s_t )\right]\; . $$

(The above result is still true when replacing $b(s_t)$ by $b_t$, where $b_t$ is a random variable with appropriate measurability – it must not depend on $a_t$.)


It is now reasonable to start wondering about which mapping $b$ could be useful to reduce variance. Because we want to reinforce (i.e make more likely) rewarding action, it is natural to substract a baseline that represent our current expectation for one action’s return. An action associated with a higher return will be reinforced, and conversely. As a result, taking $b(s) = v_\lambda^{\pi_\theta}(s)$ is a rather natural choice. Of course, the value function is itself unknown, but can also be approximated by samples (see below). The proposed estimator therefore draws from the identity: $$ \nabla_\theta U(\theta) = \mathbb{E}_{\tau\sim \mathcal{T}_\theta}\left[\sum_{t=1}^{T-1} \lambda^{t-1}\left(G_{\tau}^t-v_\lambda^{\pi_\theta}(s_t)\right)\nabla_\theta\log\pi_\theta(a_t \vert s_t )\right]\; . $$

Implementation example

To make things concrete we can have a look at some pseudo-code using the vanilla policy gradient described above – an algorithm often called Reinforce. The baseline is chosen to be the state value-function, evaluated in a Monte-Carlo style. The pseudo-code illustrates a one-step gradient ascent using the VPG estimator, after N roll-outs using $\pi_\theta$. The baseline $b(\cdot)$ is parametrized by $\omega$ living in some Euclidian space.

fit the baseline: $$\hat{\omega} \leftarrow \argmin \sum_{n=1}^N \sum_{t=1}^T (b_\omega(s_t^n) - G_n^t)^2$$ compute the stochastic gradient: $$ \hat{g}_n \leftarrow \frac{1}{N}\sum_{n=1}^N \sum_{t=1}^T \lambda^{t-1}\left(G_n^t - b_\omega(s_t^n)\right)\nabla_\theta \log\pi_\theta(a_t^n\vert s_t^n) $$ update parameter: $$ \theta \leftarrow \theta + \eta \cdot \hat{g}_n $$


Policy Gradient Theorem

We now detail the Policy Gradient theorem [Sutton et al]. It is a somewhat equivalent result to VPG, however it holds even in continuing environments. It also holds for the average-reward criterion, under minor adaptations. (Recall that VPG relied on a Monte Carlo strategy, which made sense only in episodic settings.) In what follows, let $$d_\lambda^{\pi_\theta}(s) = (1-\lambda)\sum_{t=1}^\infty \lambda^{t-1} \mathbb{P}_{s^0}^\pi \left(s_t = s\right)$$ be the discounted state-occupancy measure (it is a valid probability measure over $\mathcal{S}$). The Policy Gradient theorem states that: $$ \nabla_\theta U(\theta) = (1-\lambda)^{-1}\sum_{s\in\mathcal{S}} d_\lambda^{\pi_\theta}(s) \sum_{a\in\mathcal{A}} q_\lambda^{\pi_\theta}(s, a) \nabla_\theta \pi_\theta(a\vert s) \; . $$


The main point is that $\nabla_\theta U(\theta)$ does not depend on $\nabla_\theta q_\lambda^{\pi_\theta}$ nor $\nabla_\theta d_\lambda^{\pi_\theta}$ – which would be a serious pain to compute. It allows to write that: $$ \nabla_\theta U(\theta) \propto \mathbb{E}_{s\sim d_\lambda^{\pi_\theta}} \mathbb{E}_{a\sim\pi_\theta(s)}\left[q_\lambda^{\pi_\theta}(s, a)\nabla_\theta\log\pi_\theta(a\vert s)\right] \; , $$ justifying the computation of associated stochastic gradients. The PG theorem’s paper of [Sutton et al] goes even further, and justifies using function approximation for $q_\lambda^{\pi_\theta}$ without biasing the gradient (we won’t dive into the details: if you are interested in learning more, the keyword you are looking for is compatible function approximation.) In practice, modern (read: Deep Learning) implementations do not really care about bias; neural networks are used to approximate the state-action value function. It can be learned from full traces (Monte-Carlo style), by using Bellman back-ups, or a clever mix between the two – the Generalized Advantage Estimation paper by [Schulman & al] gives a pretty neat summary of the different techniques.


Deterministic Policy Gradients

We have been so far concerned with optimizing stochastic policies (notice how, due to the $log(\cdot)$ operator, the stochastic policy gradient might be undefined for deterministic policies). This is not a real issue, as we can expect that the optimized policy might converge to a stochastic one after sufficiently many gradient updates. Unfortunately, as the policy looses entropy, the associated stochastic policy gradient’s variance grows.


So, what happens if we were to directly optimize for deterministic policies. This is where the Deterministic Policy Gradient (DPG) of [Silver et al.] comes into play. It is mostly useful for continuous action set, i.e $\mathcal{A} = \mathbb{R}^A$. In what follows, let $\pi_\theta: \mathcal{S}\mapsto \mathbb{R}^A$ be a deterministic policy. Under adequate smoothness requirements (of the reward function and the transition probability kernel) the DPG theorem claims that: $$ \nabla_\theta U(\theta) \propto \mathbb{E}_{s\sim d_\lambda^{\pi_\theta}} \left[\nabla_\theta \pi_\theta(s) \cdot \nabla_a \left.q_\lambda^{\pi_\theta}(s, a)\right|_{a=\pi_\theta(s)}\right] \; , $$ where $\nabla_\theta \pi_\theta(s)$ is the Jacobian of the application $\pi_\theta(s)$ mapping $\theta$ to some action in $\mathbb{R}^A$ (hence if $\theta$ lives in some $\mathbb{R}^d$, $\nabla_\theta \pi_\theta(s)\in \mathbb{R}^{d\times A}$).


In turns out that the DPG is actually the limiting version of the PGT when the entropy of the stochastic policies tends to $0$ [Silver et al., Theorem 2]. As a result, the different ideas regarding compatible function approximation remain valid.

DPG as approximate Policy Improvement

DPG also has a nice intuitive ties to the generic Policy Iteration (PI) algorithm. Recall the policy improvement step of PI at round $t$: $$ \pi_{t+1}(s) \in \argmax_{a\in\mathcal{A}} q_\lambda^{\pi_t} (s, a) \; . $$ This requires solving a global maximisation step, quite burdensome whenever $\mathcal{A}$ is not finite. Instead, we can hope to retain some of the nice properties of PI by following the gradient of $q_\lambda$ (w.r.t the action). More precisely, once the policy parametrized, we can look for a direction $\delta\theta$ such that: $$ q_\lambda^{\pi_{_{\theta_t}}} (s, \pi_{\theta_t+\delta\theta}(s)) > q_\lambda^{\pi_{_{\theta_t}}} (s, \pi_{\theta_t}(s)) $$ For a small enough $\alpha$, this improvement is guaranteed by the choice: $$ \begin{aligned} \delta\theta &= \alpha \nabla_\theta\left.\left(q_\lambda^{\pi_{_{\theta_t}}} (s, \pi_{\theta}(s))\right)\right\vert_{\theta=\theta_t}\;, \\ &= \alpha\, \nabla_\theta\left.\pi_{\theta}(s)\right|_{\theta=\theta_t}\cdot \nabla_a \left.q_\lambda^{\pi_{_{\theta_t}}}(s, a)\right\vert_{a=\pi_{\theta_t}(s)}\; . \end{aligned} $$ Averaging over the states, we retrieve the DPG identity.