Successor States and Representations (1/2)

The main promise of unsupervised RL is test-time adaptation to newly specified reward functions. This requires a systemic untangling of reward and dynamics in traditional RL tools. In this first post of a short series, we see how this can be done via the concept of successor states and successor representations. We will focus on policy evaluation, leaving control for a follow-up.

One over-arching theme in unsupervised RL is the identification of a device that compactly summarises a dynamical system. It is to be used downstream for generating a near-optimal policy, but for rewards that are a-posteriori specified. It can be argued that said device can simply be a model of the dynamical system (although such models are seldom compact). In this post, and in contrast with model-based approaches, we will focus on test-time adaptation to new rewards without any state synthesis nor explicit planning.

Preliminaries

$\quad$ We will heavily use the MDP notations defined in this post.

We consider a stationary Markov Decision Process $\mathcal{M}=(\mathcal{S}, \mathcal{A}, p, r)$. We will assume that $\mathcal{M}$ is finite, i.e that $\mathrm{S}\times\mathrm{A}<\infty$ where $\mathrm{S} = \vert \mathcal{S} \vert$ and $\mathrm{A} = \vert \mathcal{A} \vert$. We solve $\mathcal{M}$ under a discounted reward criterion; for some $\lambda\in[0, 1)$: $$ v_\lambda^\pi(s) := \mathbb{E}_s^\pi\left[\sum_{t\geq 1}\lambda^{t-1}r(s_t, a_t)\right]\;. $$

Throughout this post, we will focus on policy evaluation more than control. Hence, we index everything under some stationary policy $\pi\in\mathcal{S}^\text{MR}$ at hand to focus on the specific induced Markov reward process. For instance, we will use $r_\pi(s) := \sum_{a\in\mathcal{A}} r(s, a)\pi(a\vert s)$. and $p_\pi(s^\prime\vert s) := \sum_{a\in\mathcal{A}}p(s^\prime\vert s, a)\pi(a\vert s)$ the expected reward and transition under $\pi$, respectively. Given the MDP’s finite nature, we will make heavy use of vectorial notations. Formally, for any $f:\mathcal{S}\mapsto\mathbb{R}$ and $F:\mathcal{S}\times\mathcal{S}\mapsto\mathbb{R}$ we use $\mathbf{f}\in\mathbb{R}^\mathrm{S}$ and $\mathbf{F}\in\mathbb{R}^{\mathrm{S}\times\mathrm{S}}$, respectively, for the associated vector notations. For instance, for any $s, s^\prime\in\mathcal{S}$ we have $[\mathbf{r}_\pi]_s=r_\pi(s)$ and $[\mathbf{P}_\pi]_{ss^\prime}=p_\pi(s^\prime\vert s)$. In a previous post we showed that this allowed for the compact identity: $$ \tag{1} \mathbf{v}_\lambda^\pi = \sum_{t\geq 1} \lambda^{t-1}\mathbf{P}_\pi^{t-1} \mathbf{r}_\pi, $$

$\quad$ The finiteness assumption is made for building intuition. We will relax it by the end of this post.

Successor states

Successor measure

To compactly represent the dynamics $p$ of $\mathcal{M}$, we first need to disentangle it from the rewards in the value function $v_\lambda^\pi$. This can be done via the introduction of the discounted state occupancy measure, which represents the cumulative discounted time spent in some state when starting from another; $\forall s, s^\prime\in\mathcal{S}$: $$ m_\lambda^\pi(s^\prime\vert s) := \sum_{t\geq 1} \lambda^{t-1} \mathbb{P}^\pi(s_t=s^\prime \vert s_1 = s)\;. $$ The value function indeed writes as $v_\lambda^\pi(s) = \sum_{s^\prime\in\mathcal{S}} m_\lambda^\pi(s^\prime\vert s)r_\pi(s^\prime)$. The discounted state occupancy measure $m_\lambda^\pi$ captures the dynamics of the stochastic process driven by $\pi$, while the reward function is isolated in $r_\pi$.

Proof

This result is actually immediate when glancing at (1) – it is exactly what was proved above. Introducing the matrix version of $m_\lambda^\pi$: $$ \tag{2} \mathbf{M}_\lambda^\pi := \sum_{t\geq 1} \lambda^{t-1}\mathbf{P}_\pi^{t-1}\;, $$ which checks $[\mathbf{M}_\lambda^\pi]_{ss^\prime}=m_\lambda^\pi(s^\prime\vert s)$ for any $s, s^\prime\in\mathcal{S}$, we have $\mathbf{v}_\lambda^\pi = \mathbf{M}_\lambda^\pi \mathbf{r}_\pi$. In others words, policy evaluation is as simple as materialising the matrix $\mathbf{M}_\lambda^\pi$ and the reward vector $\mathbf{r}_\pi$, and going through a simple matrix vector multiplication. There is nothing very deep about this; we already know that policy evaluation in a finite MDP is not much more than solving a linear system, since $ \mathbf{v}_\lambda^\pi = (\mathbf{I}_{d} - \lambda\mathbf{P}_\pi)^{-1} \mathbf{r}_\pi\;. $

Below, we will ignore the triviality of policy evaluation in finite MDPs and stick with our finiteness assumption to gain some intuition. The identity: $$ \mathbf{v}_\lambda^\pi = \mathbf{M}_\lambda^\pi \mathbf{r}_\pi $$ has an interesting ring to it: given $\mathbf{M}_\lambda^\pi$, we can easily evaluate $\pi$ on the fly, for any new reward function, and via a simple matrix-vector product. That is a promising feature on our test-time reward specification trip!

Successor features

In a similar fashion, some work [1] aim to represent sucessor features rather than states. Concretely, let’s assume that it exists $\phi:\mathbb{R}^{\mathrm{S}} \mapsto \mathbb{R}^d$ and $\theta\in\mathbb{R}^d$ such that for all $s, a\in\mathcal{S}\times\mathcal{A}$ we have $r(s_t, a_t) = r(s_t) = \theta^\top\phi(s_t)$. One can therefore re-write the value function at $s\in\mathcal{S}$ of any stationary $\pi$ as: $$ \begin{aligned} v_\lambda^\pi(s) &= \mathbb{E}_s^\pi\left[\sum_{t=1}^T \lambda^{t-1}\phi(s_t)\right]^\top \theta\; , \\ \end{aligned} $$ or, equivalently, $ \mathbf{v}_\lambda ^\pi = \Phi_\lambda^\pi\theta $ where $[\Phi_\lambda^\pi]_s=\sum_{s^\prime\in\mathcal{S}} \phi(s^\prime) \sum_{t=1}^t \lambda^{t-1}\mathbb{P}(s_t=s^\prime\vert s_1=s)$ represents the discounted mean feature occupied by the stochastic process starting at $s$. Observe how the state representation supersedes this approach, as the discounted mean feature can be simply retrieved via matrix-matrix multiplication; for any $s\in\mathcal{S}$ we have $ \Phi_\lambda^\pi = \mathbf{M}_\lambda^\pi \Phi $ where $\Phi = (\phi(s_1), \phi(s_2), \ldots)^\top$. As a summary, we’ll retrieve a successor feature representation via: $$ \mathbf{v}_\lambda ^\pi = \mathbf{M}_\lambda^\pi \Phi\theta\; . $$

Learning

Bellman equations

Let’s forget for a moment the obvious equality $\mathbf{M}_\lambda^\pi=(\mathbf{I}_{d} - \lambda\mathbf{P}_\pi)^{-1}$ and consider how we could characterise the discounted occupancy measure $\mathbf{M}_\lambda^\pi$ in a fashion that would allow for some efficient learning. That could be useful for large state-spaces, where matrix inversion would turn out too expensive. We clearly have in mind finding some Bellman-like equations for $\mathbf{M}_\lambda^\pi$. Turns out, such equations exist and are quite similar to the ones we write for value functions:

$$ \tag{3} \mathbf{M}_\lambda^\pi = \mathbf{I}_d + \lambda \mathbf{P}_\pi \mathbf{M}_\lambda^\pi\; . $$

In state-indexed notations, this would write that for any $s\in\mathcal{S}$: $$ \tag{4} m_\lambda^\pi(s^\prime\vert s) = 1[s=s^\prime] + \lambda \sum_{s^{\prime\prime}\in\mathcal{S}}m(s^\prime\vert s^{\prime\prime})p(s^{\prime\prime}\vert s)\; . $$

Proof

We can rewrite (3) using the forward successor Bellman operator: $$ \begin{aligned} \mathcal{T}_\lambda^{\pi, \rightarrow} : \mathbb{R}^{\mathrm{S}\times\mathrm{S}} &\mapsto \mathbb{R}^{\mathrm{S}\times\mathrm{S}}\;,\\ \mathbf{M} &\mapsto \mathbf{I}_d + \lambda \mathbf{P}_\pi\mathbf{M} \;. \end{aligned} $$ $\mathbf{M}_\lambda^\pi$ is a fixed-point of $\mathcal{T}_\lambda^{\pi, \rightarrow}$; we have $\mathcal{T}_\lambda^{\pi, \rightarrow}(\mathbf{M}_\lambda^\pi)=\mathbf{M}_\lambda^\pi$. Ressorting to the now usual dance, we can guarantee that it is actually the unique fixed-point, thanks to the contractive property of $\mathcal{T}_\lambda^{\pi, \rightarrow}$ with respect to the operator norm $|||\cdot|||_\infty$ (where $|||\mathbf{M}|||_\infty = \sup_{\|\mathbf{v}\|_\infty=1} \|\mathbf{M}\mathbf{v}\|_\infty)$. This property also opens the path for learning $\mathbf{M}_\lambda^\pi$ via a fixed-point iteration algorithm.

Proof
Note

It is clear from (2) that $\mathbf{M}_\lambda^\pi$ and $\mathbf{P}_\pi$ commute. Looking at the forward successor Bellman operator, we can immediately deduce that $\mathbf{M}_\lambda^\pi$ is also a fixed-point of a backward Bellman operator: $$ \begin{aligned} \mathcal{T}_\lambda^{\pi, \leftarrow} : \mathbb{R}^{\mathrm{S}\times\mathrm{S}} &\mapsto \mathbb{R}^{\mathrm{S}\times\mathrm{S}}\;,\\ \mathbf{M} &\mapsto \mathbf{I}_d + \lambda \mathbf{M}\mathbf{P}_\pi \;. \end{aligned} $$ In other words $ \mathbf{M}_\lambda^\pi = \mathbf{I}_d + \lambda \mathbf{M}_\lambda^\pi\mathbf{P}_\pi\; . $ Similarly to the forward case, we can show that the backward successor Bellman operator $\mathcal{T}_\lambda^{\pi, \leftarrow}$ contracts under $|||\cdot|||_\infty$.

Proof
Note

Fixed-point iterates

The fixed-point characterisations of $\mathbf{M}_\lambda^\pi$ via contractive maps allow up to write fixed-points iterations for computing $\mathbf{M}_\lambda^\pi$. Concretely, we can use two different protocols, relying on $\mathcal{T}_\lambda^{\pi, \rightarrow}$ and $\mathcal{T}_\lambda^{\pi, \leftarrow}$, respectively: $$ \begin{aligned} \mathbf{M}_{k+1}^{\rightarrow} &= \mathcal{T}_\lambda^{\pi, \rightarrow}(\mathbf{M}_k^{\rightarrow})=\mathbf{I}_d + \lambda\mathbf{P}_\pi \mathbf{M}_k^{\rightarrow}\;, \\ \mathbf{M}_{k+1}^{\leftarrow} &= \mathcal{T}_\lambda^{\pi, \leftarrow}(\mathbf{M}_k^{\leftarrow})=\mathbf{I}_d + \lambda\mathbf{M}_k^{\leftarrow}\mathbf{P}_\pi \;. \\ \end{aligned} $$ By the contractive nature of each operator, both $\{\mathbf{M}_{k}^{\rightarrow}\}_k$ and $\{\mathbf{M}_{k}^{\leftarrow}\}_k$ converge to $\mathbf{M}_\lambda^\pi$.

$\quad$ The inherent value of those updates in the finite case is limited. They, however, provide the necessary ground for stochastic approximation variants, which will be useful in continuous conditions.

One can alternate between $\mathcal{T}_\lambda^{\pi, \rightarrow}$ and $\mathcal{T}_\lambda^{\pi, \leftarrow}$ during fixed-point iterations. This has been shown to accelerate convergence of $\mathbf{M}_\lambda^\pi$ [2] by reducing the dimension of the subspace where its convergence is at its slowest.

Factorisation

Forward-Backward

Learning $\mathbf{M}_\lambda^\pi$ involves estimating $\mathrm{S}^2$ parameters, and in its current form, does not allow for much interpretation or generalisation across states. As a first step to alleviate those issues, we can learn a factorised version of $\mathbf{M}_\lambda^\pi$. Concretely, we assume a low-rank structure for the state occupancy matrix; let $\mathbf{F}_\lambda^\pi$ and $\mathbf{B}_\lambda^\pi \in\mathbb{R}^{d\times \mathrm{S}}$ with $d<\mathrm{S}$ such that: $$ \tag{5} \mathbf{M}_\lambda^\pi = (\mathbf{F}_\lambda^\pi)^\top \mathbf{B}_\lambda^\pi\; . $$

We can interpret $\mathbf{F}_\lambda^\pi$ and $\mathbf{B}_\lambda^\pi$ as mappings from $\mathcal{S}$ to $\mathbb{R}^d$, effectively embedding each state into a more compact representation. Each entry in $\mathbf{M}_\lambda^\pi$ is now obtained by the scalar products between the columns of the forward $\mathbf{F}_\lambda^\pi$ and backward $\mathbf{B}_\lambda^\pi$ matrices. Beyond providing us with tools to compactly represent states, this paves the way for using successor states in a control fashion (for a later post).

The forward and backward matrices laid out in (5) also check some fixed-point equations (by simply replacing them in $\mathbf{M}_\lambda^\pi$’s fixed-point characterisation). However, hoping to approximate them via fixed-point iterations is somewhat futile (for instance, it is clear that their definition via (5) is not unique). Below, we let go of theoretical guarantees and follow [2] ; we will instead illustrate how to compute gradient updates for $\mathbf{F}_\lambda^\pi$ based on Bellman residual minimisation. (Similar computations hold for $\mathbf{B}_\lambda^\pi$). To reduce clutter, we will now use the shorthand $\mathbf{F}:=\mathbf{F}_\lambda^\pi$ and $\mathbf{B}:=\mathbf{B}_\lambda^\pi$.

TD-like updates

Assume $\mathbf{B}$ fixed. The forward successor Bellman equations for $\mathbf{F}$ writes $ \mathbf{F}^\top\mathbf{B} = \mathbf{I}_d + \lambda\mathbf{P}_\pi\mathbf{F}^\top\mathbf{B}\; . $ A reasonable value for $\mathbf{F}$ minimises the Bellman residual: $$ \min_{\mathbf{F}\in\mathbb{R}^{d\times\mathrm{S}}} \left\| \mathbf{F}^\top\mathbf{B} - (\mathbf{I}_d + \lambda\mathbf{P}_\pi\mathbf{F}^\top\mathbf{B})\right\|_2^2\; , $$ where the matrix $\ell_2$-norm is $\|\mathbf{A}\|_2^2 = \text{Tr}(\mathbf{A}\mathbf{A}^\top)$. This program, however, does not admit a closed-form. Instead, we can follow some now well-established ideas, and update $\mathbf{F}$ along the gradients of the objective: $ J(\mathbf{F}) := \left\| \mathbf{F}^\top\mathbf{B} - (\mathbf{I}_d + \lambda\mathbf{P}_\pi\mathbf{F}^\top\mathbf{B})_\bot\right\|_2^2\;, $ where $\bot$ in the stop-gradient operator (a la target networks). This yields the update rule: $$ \tag{6} \delta \mathbf{F} \propto \mathbf{B} - \mathbf{B}\mathbf{B}^\top\mathbf{F}( \mathbf{I}_d- \lambda\mathbf{P}_\pi^\top)\; . $$ Turns out, (6) yields the basis for learning $\mathbf{F}$ under more intricate conditions, such as e.g. continuous states.

Proof

A similar update rule can be computed for $\mathbf{B}$; likewise, their backward counterpart can be obtained by minimising the backward Bellman successor residuals. This yields four different update rules for alternatively updating $\mathbf{F}$ and $\mathbf{B}$, which can be used interchangeably. [2]

Generic state-spaces

Let us now revisit our finiteness assumption and consider a continuous state-space $\mathcal{S}=\mathbb{R}^{\mathrm{S}}$. Representing the discounted state occupancy measure is no longer on the table. Let us re-define this measure for this new reality. For any measurable subset $S\subseteq\mathcal{S}$ and starting point $s\in\mathcal{S}$, the occupancy measure $m_\lambda^\pi$ writes: $$ m_\lambda^\pi(S\vert s) := \sum_{t\geq 1} \lambda^{t-1} \mathbb{P}^\pi(s_t\in S \vert s_1 = s)\;. $$ Learning a measure directly is non-trivial. Instead, we can learn its density (or Radon-Nikodym derivative) with respect to a reference probability measure $\rho$. Using infinitesimal notations: $$ m_\lambda^\pi(ds^\prime) := \tilde m_\lambda^\pi(s^\prime\vert s)\rho(ds^\prime) \;. $$ We can learn $\tilde m_\lambda^\pi(s^\prime\vert s)$ directly, which is interpreted as a similarity metric between states. It will be assumed, though, that one can sample from $\rho$ – for instance, if $\rho$ is the steady-state distribution induced by $\pi$. The forward-backward factorisation will now write $ m_\lambda^\pi(s^\prime\vert s) = f_\lambda^\pi(s)^\top b_\lambda^\pi(s^\prime) $ where $f_\lambda^\pi$ and $b_\lambda^\pi$ map $\mathcal{S}$ to $\mathbb{R}^d$. Most results detailed for the finite case can be generalised, under this set-up, to continuous condition. Learning rules for $m_\lambda^\pi$, $f_\lambda^\pi$ and $b_\lambda^\pi$ can be retrieved by following a similar logic.

References

I got introduced to successor features reading:

[1] Successor features for transfer in reinforcement learning. Barreto, Andre, et al, 2017.

which led me to the following, which this post is a condensed summary of:

[2] Learning successor states and goal-dependent values. Blier, Leonard et al, 2021.