From VI to DQN
The goal of this blog post is walk the path that saw the Value Iteration algorithm mature into a variety of modern value-based reinforcement learning methods empowered by deep learning. In particular, we will first see how VI gave birth to the Q-learning algorithm. We will then see how this tabular algorithm was gradually modified to handle function approximation and give rise to the watershed DQN algorithm.
Let $\mathcal{M} = (\mathcal{S}, \mathcal{A}, p, r)$ be the MDP of interest. We will assume for simplicity that $\mathcal{M}$ is finite: $\vert \mathcal{S} \vert \times \vert \mathcal{A}\vert < \infty$. We will denote $\mathbb{R}^{\mathcal{S}}$ (resp. $\mathbb{R}^{\mathcal{S}\times\mathcal{A}}$) the space of function mapping $\mathcal{S}$ (resp. $\mathcal{S}\times\mathcal{A}$) to $\mathbb{R}$. Finally, to match with our initial exposition of the VI algorithm, we will consider the discounted criterion to evaluate policies.
The qVI algorithm
See here for some refresher on the Value-Iteration (VI) algorithm and its convergence properties. For reasons that will soon become clear, we will here rely on an alternative version of VI, denoted qVI, which tracks the optimal state-action value function $q_\lambda^\star$ instead of its state-only counterpart $v_\lambda^\star$.
Formally, let $\mathcal{T}_\lambda^\star : \mathbb{R}^{\mathcal{S}\times\mathcal{A}} \mapsto \mathbb{R}^{\mathcal{S}\times\mathcal{A}}$ such that for any $q\in \mathbb{R}^{\mathcal{S}\times\mathcal{A}}$: $$ \tag{1} \mathcal{T}_\lambda^\star(q)(s, a) = r(s, a) + \lambda \sum_{s^\prime\in\mathcal{S}} p(s^\prime\vert s, a) \max_{a^\prime\in\mathcal{A}} q(s^\prime, a^\prime) \qquad \forall s, a \in \mathcal{S}\times\mathcal{A}\; . $$ The update rule for qVI is defined by $ q_{t+1} = \mathcal{T}_\lambda^\star(q_t)\; . $ We detail below the resulting pseudocode.
It is easy to show, by the same fixed-point argument that we used for the state value-function, that
the sequence of iterates produced by qVI converges to the optimal $q_\lambda^\star$:
$$
q_{t}(s, a) \underset{t\to\infty}{\longrightarrow} q_\lambda^\star(s, a)\; .
$$
The qVI is a control algorithm: it relies on an explicit knowledge of both the reward signal $r$ and the transition kernel $p$. Naturally, this is a fairly restrictive assumption that fails to hold in many settings.
Stochastic approximation and Q-learning
We will now lift the assumption that $r$ and $p$ are known. Instead, we will suppose that we can only access them via sampling. Formally, we consider the following sequential interaction setting. At every round, we submit an action $a_t$ and observe both the reward $r(s_t, a_t)$ and the next state $s_{t+1} \sim p(\cdot\vert s_t, a_t)$. We then submit $a_{t+1}$, observe $r(s_{t+1}, a_{t+1})$, .. and so on.
Let us define the operator $\hat{\mathcal{T}}_\lambda: \mathbb{R}^{\mathcal{S}\times\mathcal{A}} \mapsto \mathbb{R}^{\mathcal{S}\times\mathcal{A}}$ such that for any $q\in\mathbb{R}^{\mathcal{S}\times\mathcal{A}}$: $$ \hat{\mathcal{T}}_\lambda(q)(s_t, a_t) = r(s_t, a_t) + \lambda \max_{a^\prime\in\mathcal{A}} q(s_{t+1}, a^\prime)\; . $$ Observe that: $$ \begin{aligned} \mathbb{E}\left[\hat{\mathcal{T}}_\lambda(q)(s_t, a_t)\vert s_t, a_t\right] &= \mathbb{E}\left[r(s_t, a_t) + \lambda \max_{a^\prime\in\mathcal{A}} q(s_{t+1}, a^\prime)\right]\;, \\ &= r(s_t, a_t) + \lambda \sum_{s^\prime\in\mathcal{S}} p(s^\prime\vert s_t, a_t) \max_{a^\prime\in\mathcal{A}} q(s^\prime, a^\prime) \;, \\ &= \mathcal{T}_\lambda^\star(q)(s_t, a_t)\; . \end{aligned} $$ That is, $\mathbb{E}[\hat{\mathcal{T}}_\lambda(q)] = \mathcal{T}_\lambda^\star(q)$. We now have access to a noisy version of the Bellman operator $\mathcal{T}_\lambda^\star$. To understand how should qVI be modified to fit this new setting, we must talk about stochastic approximation algorithms.
Stochastic Approximation
The stochastic approximation problem refers to the computation of a root $x^\star$ for some operator $F$, when $F$ can only be queried up to some noise: $ \hat{F}(x) := F(x) + \varepsilon$ and $\mathbb{E}\left[\varepsilon\right] = 0 \; . $ The Robbins-Monro algorithm is an iterative algorithm which builds a sequence $\{x_t\}_t$ as follows: $$ x_{t+1} = x_t - \alpha_t \hat{F}(x_t)\;, $$ where $\{\alpha_t\}$ follows the so-called tapering conditions: $\sum_{t\geq 0} \alpha_t = +\infty$ and $\sum_{t\geq 0} \alpha_t^2 < +\infty$. Then, under appropriate conditions over $F$ and the noise’s distribution, we have the almost-sure convergence $x_t \overset{\text{a.s}}{\to} x^\star$. In our case, we are looking for the unique fixed-point of $\mathcal{T}_\lambda^\star$, or equivalently, for the unique root of the operator $\mathcal{B}_\lambda^\star := \mathcal{T}_\lambda^\star - \text{Id}$. The Robbins-Monro algorithm applied to $\hat{\mathcal{B}}_\lambda := \hat{\mathcal{T}}_\lambda - \text{Id}$ suggests maintaining the sequence: $$ \tag{2} q_{t+1} = q_t - \alpha_t(\hat{\mathcal{T}}_\lambda - \text{Id}) q_t \; . $$ For a given state-action couple $(s, a)$ this yields: $$ \begin{aligned} q_{t+1}(s, a) &= q_t(s, a) - \alpha_t \big[r(s, a) + \lambda \max_{a^\prime\in\mathcal{A}} q_t(s^\prime, a^\prime) - q_t(s, a) \big]\; , &(s^\prime\sim p(\cdot\vert s, a))\\ &= q_t(s, a) + \alpha_t \delta_t(s, a) \end{aligned} $$ where $\delta_t(s, a):=r(s, a) + \lambda \max_{a^\prime\in\mathcal{A}} q_t(s^\prime, a^\prime) - q_t(s, a)$ is often refered to as the temporal-difference error. Equation $\text{(2)}$ is, in essence, the Q-learning algorithm, which is therefore not much more than “just” the stochastic approximation of the qVI algorithm (up to some details.)
Q-Learning
We provide below the pseudocode for Q-learning, before moving to its convergence properties.
How we visit the MDP is dictated by a so-called behavioral policy $\pi_\beta= (d_\beta, d_\beta, \ldots)$,
which is
independent of the optimal $\pi^\star$. For this reason, Q-learning is an off-policy algorithm.
There are only a few constraints imposed on this data-collection policy (see below).
Notice how under the Q-learning algorithm the update at round $t$ writes: $$ \begin{aligned} q_{t+1}(s, a) &= q_{t}(s, a) + \alpha_t\hat{\mathcal{B}}(q_{t})(s, a) \mathbf{1}\left[s_t=s, a_t=a\right]\; ,\\ &= q_{t}(s, a) +\alpha_t( \underbrace{\hat{\mathcal{B}}-\mathcal{B}^\star)(q_{t})(s, a)}_{\mathbb{E}[\cdot] = 0} \mathbf{1}\left[s_t=s, a_t=a\right] + \alpha_t\mathcal{B}^\star (q_{t})(s, a) \mathbf{1}\left[s_t=s, a_t=a\right] \; ,\ \end{aligned} $$ where $\mathbf{1}[\cdot]$ is the indicator function. Therefore: $$ \begin{aligned} \mathbb{E}[q_{t+1}(s, a)] &= q_{t}(s, a) + \alpha_t \mathcal{B}_\lambda^\star q_t(s, a) \mathbf{1}\left[s_t=s, a_t=a\right]\; ,\\ &= q_{t}(s, a) + \alpha_t \mathcal{B}_\lambda^\star q_t(s, a) \text{ only if } s=s_t, \; a = a_t. \end{aligned} $$ Unfortunately, this does not perfectly match with the description of the Robbins-Monro algorithm, which required our surrogate to be an unbiased estimate of the noiseless function we were trying to find the root of. Indeed, $\mathbb{E}[q_{t+1}(s, a)] \neq q_{t}(s, a) + \alpha_t \mathcal{B}_\lambda^\star q_t(s, a)$ for any couple $(s, a) \neq (s_t, a_t)$. Fortunately, this unbiased assumption can be lessened, asking only for the bias to be “small” enough for it to vanish over time. We state below the proper theorem for the sake of completeness:
Function approximation
Stochastic approximation and the Q-learning algorithm release us from having the precise knowledge of the transition kernel $p$ – it now suffices to be able to query it for new samples. Practical implementation of Q-learning is, however, still limited by its 1) memory footprint and 2) generalisation abilities.
Regarding 1), the main culprits are the q-values themselves: there are $\vert \mathcal{S}\vert \times \vert \mathcal{A} \vert$ of them, and many relevant environments have the bad idea of coming with vast state spaces (i.e. $\vert \mathcal{S} \vert \gg 1$) – if not infinite ($\mathcal{S} = \mathbb{R}^d$). When it comes to 2), observe that the tabular representation of the q-function does not allow for permeability between states, since we have no way of saying that two states are “close”. Equipping the state space with some structure would allow transferring knowledge acquired in one state to similar ones, effectively reducing Q-learning’s sample complexity.
The typical way forward is to let go of the tabular representation of the q-values to adopt a parametric one. For concreteness, assume the existence of a feature map $ \phi : \, \mathcal{S}\times\mathcal{A} \mapsto \mathbb{R}^d \ $ where $d \ll \vert \mathcal{S} \vert$. We will search for $q_\lambda^\star$ (or at least a good approximation of it) within the class of linear functions: $$ \mathcal{F} := \{ (s, a) \mapsto \theta^\top\phi(s, a), \, \theta \in \Theta\} $$ where $\Theta$ is a compact subset of $\mathbb{R}^d$. Below we’ll denote $q_\theta$ the function that assigns $\theta^\top\phi(s, a)$ to the couple $(s, a)\in\mathcal{S}\times\mathcal{A}$. Now comes a design choice: what is a “good” $\theta^\star \in \Theta$ to represent $q_\lambda^\star$? Explicitly enforcing a small representation error $\lVert q_{\theta^\star} - q_\lambda^\star \rVert_\infty$ is of course unfeasible, as this would require the very knowledge of $q_\lambda^\star$. An implicit way to achieve this kind of control is by minimizing the so-called Bellman residuals: $$ \tag{2} \theta^\star \in \argmin_\theta \lVert q_\theta - \mathcal{T}_\lambda^\star (q_\theta)\rVert_\infty \; . $$ Observe that if $q_\lambda^\star \in \mathcal{F}$, then $q_{\theta^\star} = q_\lambda^\star$ – the Bellman residual of $q_\lambda^\star$ is 0. Conversely, we can have a bound on our approximation error, which naturally involves the best-in-class approximation error of $\mathcal{F}$: $$ \lVert q_{\theta^\star} - q_\lambda^\star \rVert_\infty \leq \frac{1+\lambda}{1-\lambda} \min_{q_\theta \in \mathcal{F}} \lVert q_\theta - q_\lambda^\star \rVert_\infty \; . $$
The objective $\text{(2)}$ is therefore a promising candidate for finding meaningful approximations of $q_\lambda^\star$ supported by a low-dimensional parameter. However, it is not continuously differentiable and does not easily undergo gradient-based optimisation. Instead, we will pursue an $\ell_2$ objective, more amenable to numerical optimisation techniques: $$ \theta_* \in \argmin_{\theta} \, \lVert q_\theta - \mathcal{T}_\lambda^\star(q_\theta) \rVert_2 \; . $$
Of course, at this point, if we judged that $\vert \mathcal{S}\vert$ was large enough to motivate function approximation, surely we cannot plan on using the Bellman operator $\mathcal{T}_\lambda^\star$ – it costs roughly $\mathcal{O}(\vert\mathcal{S}\vert^2\cdot\vert\mathcal{A}\vert)$ operations to compute. It’s time for stochastic approximation to kick in again. Materialising the full $\ell_2$-norm is also questionable: instead, we have to settle for minimising the Bellman residuals over a much smaller set of state and actions. Concretely, for a given set $(s_1, a_1, s^\prime_1, \ldots, s_n, a_n, s^\prime_n)$, we compute $\theta_\star \in \argmin_\theta J(\theta)$ where: $$ \begin{aligned} J(\theta) &:=\sum_{s, a\in\mathcal{D}} (q_\theta(s, a) - \hat{\mathcal{T}}_\lambda (q_\theta)(s, a))^2 \; , \\ &= \sum_{s, a\in\mathcal{D}} (q_\theta(s, a) - r(s, a) - \lambda \max_{a^\prime\in\mathcal{A}} q_\theta(s^\prime, a^\prime) )^2\;. & (s^\prime\sim p(\cdot\vert s, a)) \end{aligned} $$
Fitted Q-iterations
This optimisation program $\text{(4)}$ is still somewhat funny looking – namely because of the term $\max_{a^\prime\in\mathcal{A}} q_\theta(s^\prime, a^\prime)$. One way to resolve this is to use a reference parameter $\theta_\text{ref}$ to produce said targets. Concretely, we would be interested in computing $\theta\in\argmin_\theta J(\theta, \theta_\text{ref})$ where: $$ J(\theta, \theta_\text{ref}) =\sum_{(s, a, s^\prime)\in\mathcal{D}} (q_\theta(s, a) - r(s, a) - \lambda \max_{a^\prime\in\mathcal{A}} q_{\theta_\text{ref}}(s^\prime, a^\prime) )^2 $$ Of course, to provide reasonable candidate for our fixed-point problem, one must have $\theta_\text{ref} \approx \theta$. The fitted Q-iteration algorithm (introduced by [Ernst, 2005] with tree-based function approximation) leverages this rationale, by iteratively applying: $$ \tag{3} \begin{aligned} \theta_{t+1} &= \mathcal{T}_\lambda^\text{FQ}(\theta, \theta_t) \;, \\ &= \argmin_\theta J(\theta, \theta_t) \; . \end{aligned} $$ For our linearly parametrised example $q_\theta(\cdot) = \theta^\top\phi(\cdot)$, this boils down to solving a $d$-dimensional linear system. For completeness, we give below some pseudocode for the resulting linearly fitted q-iterations. The action selection process is typically $\varepsilon$-greedy: $$ a \sim \pi_\theta^\varepsilon(\cdot\vert s) \text{ where } \pi_\theta^\varepsilon(a\vert s) = \left\{\begin{aligned} 1 - \varepsilon &\text{ if } a\in\argmax_{a^\prime} q_\theta(s, a^\prime)\;,\\ \frac{\varepsilon}{\vert \mathcal{A}\vert - 1} &\text{ otherwise.}\end{aligned}\right. $$
The DQN breakthrough
The Q-fitted iterations could, in theory, support deep neural networks function approximation. (Actually, people have made it work in this setting, but mostly for toy environments). The main appeal is, of course, to search within much richer function classes without having to undertake painful feature engineering when, say, the state space $\mathcal{S}$ is the visual rendering of some Atari game. One can list a few obstacles standing in the road for neural-networks powered Q-learning, partially solved by the original DQN paper.
The first one is 1) computational: solving Eq. $\text{(3)}$ at each round is challenging to say the least – because there is no longer a closed form, we are talking about solving a non-convex program at each round. Instead, one can settle for a one-step stochastic gradient descent update. When considering a unique sampled experience $(s, a, r, s^\prime)$ this would look like $\theta_{t+1} = \theta_t - \alpha \Delta\theta_t$ where: $$ \tag{4} \begin{aligned} \Delta\theta_{t} &\propto \nabla_{\theta}\left(q_\theta(s, a) - r - \lambda\max_{a^\prime}q_{\theta_t}(s^\prime, a^\prime)\right)^2\Big\vert_{\theta=\theta_t}\;, \\ &\propto \left(q_{\theta_t}(s, a) - r - \lambda\max_{a^\prime}q_{\theta_t}(s^\prime, a^\prime)\right) \nabla_{\theta} q_\theta(s, a)\Big\vert_{\theta=\theta_t} \; . \end{aligned} $$ Now, this update should be computed on a minibatch of experience instead of a single one. Drawing this minibatch from the freshest bunch of experience is delicate, as those are highly correlated by nature (temporal correlation). This could lead to dangerous spurious updates. The DQN paper solves this by sampling experience uniformly at random within $\mathcal{D}$, which will now be re-baptised the replay buffer.
We just touched to the second main challenge, which is tied to 2) stability. Because of their representational capacity, training neural networks for finding fixed-points is tough: as we just saw, this involves having the neural network provides its own regression targets. Since said neural network keeps on getting updated, this induces tracking some non-stationary targets. Therefore, bluntly applying $\text{(4)}$ leads to oscillations, divergences, and overall disappointing performance. The DQN authors fixed this by introducing the concept of target networks. Briefly, they froze the underlying networks for a few iterations, enforcing stationarity in the targets and henceforth reducing oscillations. Concretely, this means maintaining another set of parameter $\theta_t^{-}$ that only get periodically updated throughout the training: $$ \begin{aligned} \theta_{t+1}^{-} &= \theta_{t+1} \text{ if } t\equiv 0 \pmod{\tau} \;, \\ &= \theta_{t}^{-} \text{ otherwise}, \end{aligned} $$ which will provide the q-targets (notice the $\color{orange}\text{difference}$ with Eq. (4)): $$ \tag{5} \Delta\theta_{t} = \left(q_{\theta_t}(s, a) - r - \lambda\max_{a^\prime}q_{\color{orange}\boldsymbol{\theta^{-}_t}}(s^\prime, a^\prime)\right) \nabla_{\theta} q_\theta(s, a)\Big\vert_{\theta=\theta_t} \; . $$
That’s it–starting from first principles (qVI) we can now write down some pseudo-code for DQN. Below, the dimension $d$ refers to the degrees of freedom (number of weights and biases) of some neural-network architecture. Similarly to above, we collect data using the $\varepsilon$-greedy strategy $\pi_\theta^\varepsilon$.
Refinements
That is not the end of the story. Since the original DQN paper, myriad of improvements got thrown into the mix. The Rainbow, Agent57 and BBF papers all, in their time, combined the most relevant improvements to yield ever better versions of DQN (at least, from the Atari benchmark perspective.) The goal here is not to cover all of them tricks; however, we will quickly brush over some that will (maybe) one day be topics of blog-post of their own 🤞.
Double Q-learning
This is about fighting the over-estimation bias inherent to Q-learning. Because of random realisations, or unlucky function approximation, it is possible that our learning algorithm over-estimates, at one point, some Q-value. Say, for some $s^+, a^+\in\mathcal{S} \times \mathcal{A}$ that $ q_t(s^+, a^+) > q_\lambda^\star(s^+, a^+)\; . $ If $a^+ \in \argmax_a q_t(s^+, a)$ this will directly back-propagate our whole set of q-values since the Q-learning update at some state-action pair $(s, a)$ precessing $s^+$ writes: $$ q_{t+1}(s, a) = (1-\alpha)q_t(s, a) + \alpha\left(r(s, a)+\lambda q_t(s^+, a^+)\right)\; . $$ To avoid propagating over-estimated values, the original double Q-learning paper suggests introducing two seperates estimators $q_t^1$ and $q_t^2$, using distincts set of samples. When updating $q_t^1$, the target value is obtained by using $q_t^1$ on top of an action picked by $q_t^2$ (and vice-versa). Concretely: $$ q^1_{t+1}(s, a) = (1-\alpha)q^1_t(s, a) + \alpha\left(r(s, a)+\lambda q^1_t(s^+, \argmax_a q^2_t)\right)\; . $$ Via this decoupling between selection and predection, and since $q^1$ and $q^2$ were trained on different samples, we can hope to lessen over-estimation.
Maintaining two sets of neural networks can be (was?) judged too cumbersome – in DQN-like approach, the online network replaces $q_2^t$ to select the maximising action. The update rule (5) becomes: $$ \tag{5} \Delta\theta_{t} \left(q_{\theta_t}(s, a) - r - \lambda q_{\theta^{-}_t}(s^\prime, \argmax_{a^\prime} q_{\theta_t}(s^\prime, a^\prime))\right) \nabla_{\theta} q_\theta(s, a)\Big\vert_{\theta=\theta_t} \; . $$
Prioritized replay buffers
Prioritized experience replay has now become a central component of modern value-based method.s The idea takes its roots in the algorithmic concept of Asynchronous Value Iteration–more precisely the Prioritized Value Iteration algorithm. Without going into too many details (that would be for another post), the main idea is to replace the uniform sampling in $\mathcal{D}$ by some sampling rule that enforces a uniform error through $\mathcal{D}$. Concretely, the prioritised experience replay paper encourages over-sampling transitions $(s, a, r, s’)$ which come with a large online Bellman residual error $\Delta$, where: $$ \Delta := \left\vert q_{\theta_t}(s, a) - r(s, a) - \lambda \max_{a^\prime} q_{\theta_t}(s^\prime, a^\prime)\right\vert \; . $$
Multi-step learning
This of course touches to the never-ending bias vs. variance trade-off. Recall the fixed-point equation solved by $q_\lambda^\star$; for any $s, a\in\mathcal{S}\times\mathcal{A}$: $$ \begin{aligned} q_\lambda^\star(s, a) &= \mathcal{T}_\lambda^\star(q_\lambda^\star)(s, a) \;, \\ &= r(s, a) + \lambda\mathbb{E}_{s^\prime}\left[\max_{a^\prime} q_\lambda^\star(s’, a’)\right] \; . \end{aligned} $$ Iterating twice, we obtain: $$ \tag{6} \begin{aligned} q_\lambda^\star(s, a) &= (\mathcal{T}_\lambda^\star)^2(q_\lambda^\star)(s, a) \;, \\ &= r(s, a) + \lambda\mathbb{E}_{s^\prime}\left[\max_{a^\prime} \left\{r(s’, a’) + \lambda\mathbb{E}_{s^{\prime\prime}}\left[\max_{a^{\prime\prime}}q_\lambda^\star(s^{\prime\prime}, a^{\prime\prime})\right]\right\}\right] \; , \end{aligned} $$ and so on. Now, unlike many TD($\lambda$)-like methods, Q-learning is off-policy – we don’t (can’t) collect data according to $\pi^\star$ (that would make things a bit too easy). Computing an n-step target based on (6), thanks to data collected by an $\varepsilon$-greedy policy is tricky. Instead of delving into Watkin’s and Peng’s Q($\lambda$) methods, the descendants of DQN adopt a pragmatic (but somewhat unprincipled) approach. For $n\in\mathcal{N}$ and a chunk of trajectory $(s_1, a_1, \ldots, a_{n}, s_{n+1})$ they compute the following target: $$ \sum_{i\leq n} \lambda^{i-1} r(s_i, a_i) + \lambda^n \max_{a} q_t(s_{n+1}, a) \; , $$ given us the update rule: $$ \tag{5} \Delta\theta_{t} \left(q_{\theta_t}(s, a) - \sum_{i\leq n} \lambda^{i-1} r(s_i, a_i) - \lambda^n \max_{a} q_t(s_{n+1}, a)\right) \nabla_{\theta} q_\theta(s, a)\Big\vert_{\theta=\theta_t} \; . $$ Small values of $n$ (e.g. $n=5$) have been shown to empirically outperform the 1-step targets. Most recent approaches have $n$ vary throughout the learning; using large values (high variance but low bias) at first, then smaller once the q-values estimates are more reliable (low variance and reasonable bias).