Successor States and Representations (2/3)

In this second post of this series, we take a break from successor measures to focus on successor features. We will first review the use of a generalised policy improvement mechanism that can efficiently leverage the successor features of existing policies to enable zero-shot transfer to new tasks. We will then discuss the generalisation to universal successor features approximations, allowing direct zero-shot control.

$\quad$ The first post of this series can be found here.

ⓘ

We have introduced in an earlier post the successor measure $m_\lambda^\pi$ of a stationary policy $\pi$, as $\forall s, s^\prime\in\mathcal{S}$:

$$ \tag{1} m_\lambda^\pi(s^\prime\vert s) = \sum_{t\geq 1} \lambda^{t-1}\mathbb{P}^\pi(s_t=s^\prime\vert s_1=s)\;. $$ We focused our original discussion on the efficient estimation / learning of $m_\lambda^\pi$ in a reward-free MDP, and how it could be used for cheap evaluation of $\pi$ under any a-posteriori specified reward. We postponed any control consideration; it should be, for now, rather blurry how successor measures really help us find optimal policies in a zero-shot manner. This post aims at providing a smooth introduction to this question by focusing on the less general (but still quite interesting!) successor features.

As usual, we will work with finite MDPs for simplicity. We will denote $\mathcal{M} = (\mathcal{S}, \mathcal{A}, p)$ some reward-free MDP, with $\mathrm{S}=\vert\mathcal{S}\vert$ and $\mathrm{A}=\vert\mathcal{A}\vert$. It can be further specified by any reward function $r\in\mathbb{R}^{\mathrm{S}\times\mathrm{A}}$.

Successor Features

$\quad$ Many results in this section are consequences of more general properties inherited from successor measures. For the sake of completeness, we will derive them from first principles.

ⓘ

Policy evaluation

Let $\phi : \mathcal{S}\times\mathcal{A}\mapsto\mathbb{R}^d$ some given feature map. For any stationary policy $\pi$, we define its successor features: $$ \tag{2} \psi_\lambda^\pi(s, a) := \mathbb{E}_{s, a}^\pi\Big[\sum_{t\geq 1} \lambda^{t-1}\phi(s_t, a_t)\Big]\;, $$ for any state-action pair $(s, a)\in\mathcal{S}\times\mathcal{A}$. Observe that successor features are only concerned by the dynamics $p$ of $\mathcal{M}$. They become useful when considering rewards that are linear in $\phi$. Concretely, we will now consider reward signals $r\in\mathbb{R}^{\mathrm{S}\times\mathrm{A}}$ where for any $(s, a)\in \mathcal{S}\times\mathcal{A}$: $$ \tag{3} r(s, a) = \theta^\top\phi(s, a)\;, $$ for some $\theta\in\mathbb{R}^d$. For any MDP $\mathcal{M}_\theta$ yielded by the speciation of $\mathcal{M}$ with rewards parametrised by $\theta$, there is a direct link between the values of policies and their successor features.

$\qquad\qquad\qquad\qquad\quad \text{The value of any policy } \pi \text{ is linear in its successor features; for all } (s, a)\in\mathcal{S}\times\mathcal{A}$: $$ \tag{4} q_\lambda^\pi(s, a)= \theta^\top \psi_\lambda^\pi(s, a)\;. $$

Successor features

▶ Proof

This has direct consequences for policy evaluation. Successor features can be used to evaluate $\pi$ given any reward signal checking (3) without having to experience said rewards—only at the cost of a scalar product.

Bellman equations

Successor features check some Bellman-like equations. Observe that successor features map $\mathcal{S}\times\mathcal{A}$ to $\mathbb{R}^d$, hence they belong to $\mathbb{R}^{d\times(\mathrm{S}\times\mathrm{A})}$. Introduce the Bellman operator $\mathcal{T}_\lambda^{\pi, \phi} : \mathbb{R}^{d\times(\mathrm{S}\times\mathrm{A})} \mapsto \mathbb{R}^{d\times(\mathrm{S}\times\mathrm{A})}$ where for any function $\psi\in\mathbb{R}^{d\times(\mathrm{S}\times\mathrm{A})}$ and state-action pair $(s, a)$: $$ \mathcal{T}_\lambda^{\pi, \phi}(\psi)(s, a) = \phi(s, a) + \lambda \sum_{s^\prime,a^\prime} p(s^\prime\vert s, a)\pi(a^\prime\vert s^\prime)\psi(s^\prime, a^\prime)\;. $$ This operator checks the usual properties of Bellman operator; it is contracting, and $\psi_\lambda^\pi$ is a fixed-point.

$\qquad\qquad\qquad\qquad\quad \text{For any stationary policy }\pi,\text{ its successor features } \psi_\lambda^\pi \text{ are the only fixed point of }\mathcal{T}_\lambda^{\pi, \phi}.$ $$ \tag{5} \mathcal{T}_\lambda^{\pi, \phi}(\psi_\lambda^\pi)=\psi_\lambda^\pi\;. $$

Bellman equations

▶ Proof

In short, the implications are that successor features can be learned via classical dynamic programming mechanisms. When coupled with stochastic approximation (sampling) and function approximations (say, neural networks), all the usual tricks (Bellman error minimisation, replay buffer, target networks, etc.) are valid. In deep reinforcement learning terms, it is safe to think of learning successor features to be roughly similar as implementing one of the many variants of DQN (except here for policy evaluation, not control).

Transfer

Below, we follow [1] and apply successor features to the transfer problem. Concretely, we assume access to some pre-trained policies $\{\pi_1, \ldots, \pi_n\}$. For instance, each can be a (near-)optimal policy for the reward $r_i(s, a) = \theta_i^\top\phi(s, a)$. As we emphasized earlier, evaluating each $\pi_i$ on $\mathcal{M}_\theta$ is immediate: $$ q_\lambda^{\pi_i}(s, a) = \theta^\top \psi_\lambda^{\pi_i}(s, a). $$ A naive (but effective) way to use this for our advantage is to perform this evaluation at the starting state and greedily follow to the highest valued policy. The authors of [1] show us we can do better by extending policy improvement to several base policies; concretely, to form $\pi$ such that for all $s\in\mathcal{S}$: $$ \tag{6} \pi(s) \in \argmax_{a\in\mathcal{A}}\max_{i=1, \ldots, n} \theta^\top \psi_\lambda^{\pi_i}(s, a)\;. $$ This guarantees that $\pi$ dominates any of the $\pi_i$ on $\mathcal{M}_\theta$. If the base policies $\{\pi_i\}_i$ can be indeed “stitched” together to yield a performant policy on $\mathcal{M}_\theta$, transfer will be successful.

$\qquad\qquad\; \text{Let } \pi\text{ given by (6)}. \text{ Then, for any }i\in\{1, \ldots, n\} \text{ and }s, a\in\mathcal{S}\times\mathcal{A}:$ $$ q_\lambda^{\pi}(s, a) \geq q_\lambda^{\pi_i}(s, a)\;. $$

Transfer

▶ Proof

Universal SFs

The previous section leverages successor features’ fast policy evaluation, which indirectly unlocks fast (multi-) policy improvement and, as a result, transfer. In this section, we follow [2] and rather focus on how we can directly train some zero-shot capabilities. We are more interested in directly learning to predict: $$ \psi_\lambda^{\pi^\star}(s, a) = \mathbb{E}_{s, a}^{\pi^\star}\left[ \sum_{t\geq 1}\lambda^{t-1}\phi(s_t, a_t)\right]\;, $$ the successor feature of the optimal policy $\pi^\star$–without any interaction with $\mathcal{M}_\theta$, of course. Doing so will require learning on a variety of tasks (several values of $\theta$), so we can hope to generalise zero-shot to new ones. Below, we denote $\pi_\theta^\star$ the optimal value for the MDP $\mathcal{M}_\theta$. Let the universal successor feature be: $$ \begin{aligned} \psi\, : \,\mathcal{S}\times\mathcal{A}\times\mathbb{R}^d &\mapsto \mathbb{R}^d \;,\\ (s, a, \theta) &\mapsto \psi_\lambda^{\pi_\theta^\star}(s, a)\;. \end{aligned} $$

It also checks a Bellman equation. Indeed, for all $s, a\in\mathcal{S}\times\mathcal{A}$, and for all $\theta\in\mathbb{R}^d$

$$ \psi(s, a, \theta) = \phi(s, a) + \sum_{s^\prime}p(s^\prime\vert s, a)\psi(s^\prime, \argmax_{a^\prime}\theta^\top\psi(s^\prime, a^\prime, \theta), \theta)\;. $$ Finally, observe that the optimal policy $\pi_\theta^\star$ for $\mathcal{M}_\theta$ can readily be retrieved from the universal successor feature; because of its ties to state-action value function, we have that $ \pi_\theta^\star(s) \in \argmax_{a\in\mathcal{A}} \theta^\top \psi(s, a, \theta)\; $.

▶ About universal value functions

We provide below some pseudocode to illustrate the learning of the universal successor features. Using standard “tricks” (replay buffer, target networks, etc.) is recommended, but skipped here for clarity. The point is to illustrate the main features that arise when training $\psi$. This naive algorithm iterates through tasks by sampling $\theta\in\mathbb{R}^d$. Because we decoupled dynamics from rewards, one can train $\psi$ for several values of $\theta$ even when executing on a single task. We actually use the sampled task only to guide our exploration policy, as we will pick up action in an (almost) greedy fashion w.r.t $\theta^\top\psi(s, a, \theta)$–a.k.a the (approximated) q-values of the optimal policy for the MDP parametrised by $\theta$.

$\textbf{init } \text{learning rate } \alpha, \text{ number of tasks to train } K, \text{ spread } \varepsilon, \text{ episode length } T\\$ $\textbf{while } \text{has not converged}\\$ $\quad \text{ sample task }\theta\in\mathbb{R}^d\\$ $\quad \text{ sample which policies to train for } \theta_1, \ldots, \theta_n\overset{\text{i.i.d}}{\sim}\mathcal{N}(\theta, \varepsilon \mathbf{I}_d)\\$ $\quad\text{\color{CadetBlue}\# collect episode}\\$ $\quad\textbf{for } t=1,\ldots, T\\$ $\quad\quad\text{observe } s_t\\$ $\quad\quad\text{\color{CadetBlue}\# $\varepsilon$-greedy exploration policy with (6)}\\$ $\quad\quad\text{pick } a_t \leftarrow \argmax_a\max_{k=1, \ldots, K} \theta^\top \psi(s, a, \theta_k) \text{ with proba } 1-\varepsilon, \text{ else } a_t\leftarrow \mathcal{U}(\mathcal{A})\\$ $\quad\quad\text{observe } s_{t+1}\\$ $\quad\quad\textbf{for } k=1,\ldots, K\\$ $\quad\quad\quad\text{\color{CadetBlue}\# form the regression targets}\\$ $$ t_k \leftarrow \phi(s_t, a_t) + \lambda\psi(s_{t+1}, \argmax_a \theta_k^\top \psi(s_{t+1}, a, \theta_k), \theta_k)\;. $$ $\quad\quad\textbf{end for}\\$ $\quad\quad\text{\color{CadetBlue}\# gradient step}\\$ $$ \psi \leftarrow \psi - \alpha \nabla_\psi \sum_{k=1}^K \left\|\psi(s_t, a_t, \theta_k) - t_k\right\|_2^2\;. $$ $\quad\textbf{end for}\\$ $\textbf{end while}\\$ $\textbf{return } \psi$

$\texttt{Learning universal SFs}$

Feature selection

So far, we have assumed that the feature map $\phi$ and the reward vector $\theta$ is known. The former assumption is checked when state featurisation is “easy”; e.g. by building event detectors for picking up a relevant object, moving into a specific direction, etc. Once we have a feature map, estimating $\theta$ for a given reward signal can naively be achieved by regression on some buffered data.

However, in general, designing $\phi$ from scratch is arguably hard. It is desirable to extract it directly from data, e.g. some trajectories collected arbitrarily, or via some pre-trained policies. Proving a comprehensive list of self-supervised criterion to learn $\phi$ is beyond the scope of this post–also, such a list can be found directly in [4] . There, successor features are compared to “forward-backward” representations, which we studied in the first post of this series. The link between the two approaches is the next post’s topic!

References

[1] Successor Features for Transfer in Reinforcement Learning, Barreto et al. 2016.

[2] Universal Successor Features Approximators, Borsa et al. 2019.

[3] Universal Value Function Approximators, Schaul et al. 2015.

[4] Does Zero-Shot Reinforcement Learning Exist? Touati et al. 2023.

and references from the previous post of this series.