# Model-based RL (1/?)

This post is the first of a short series, concerned with model-based RL. We will start walking this road via the principled trail: describing and analysing $\texttt{UCB-VI}$, a theoretically grounded algorithm to solve unknown finite MDPs. More precisely, we will see how this model-based approach directly models the unknown MDP and uses optimism for strategic exploration to provably find “good” policies in finite-time. Concentration inequalities and regret bounds incoming, fun!

The $\texttt{UCB-VI}$ algorithm was introduced by [Azar & al, 2017] at ICML. It comes with improved guarantees compared to its predecessors, such as UCRL [Jaksch & al, 2010].

## Problem definition

Let $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}, r)$ some given MDP. We will assume than $\mathcal{S}$ and $\mathcal{A}$ are known (required), as well as the reward signal $r$ (not required, and after reading this post, you should see this is not a restriction, but rather a simplification for ease of exposition). Only the transition kernel $\mathcal{P}$ is unknown to us. Wlog, we will also assume that the rewards are absolutely bounded by 1, that is $\vert r(s, a) \vert \leq 1$ for any $s, a\in\mathcal{S}\times\mathcal{A}$.

Straying away from earlier posts where we were concerned with a discounted, infinite-horizon criterion, we will here consider solving $\mathcal{M}$ under a finite-time criterion. That is, for some $H\in\mathbb{N}$ and initial state $s_1\in\mathcal{S}$: $$ \argmax_{\pi\in\Pi^\text{MD}} v_1^\pi(s_1) := \mathbb{E}_{s_1}^\pi \left[\sum_{h=1}^H r(s_h, a_h)\right]\; . $$ Recall that $\Pi^\text{MD}$ is the set containing all the sequences $(d_1, \ldots, d_H)$ of deterministic, Markovian decision rules. We will denote $\pi^\star=(d_1^\star, \ldots, d_{H}^\star)$ the optimal policy and: $$ v_h^\star(s) := \mathbb{E}_{s}^{\pi^\star} \left[\sum_{h^\prime=1}^H r(s_h^\prime, d_{h^\prime}(s_h^\prime))\right] = \max_{a\in\mathcal{A}} \left\{r(s, a) + \mathbb{E}_{s^\prime}^a\left[v_{h+1}^\star(s^\prime)\right]\right\}\;. $$ its $h$-tail value function (the second equality being a direct statement of Bellman’s optimality principle).

We are interested in algorithms that, in finite-time, are able to find decent policies *without* the knowledge of $\mathcal{P}$.
One such algorithm can experience the environment in a sequential fashion: at round $k$, it will submit a policy
$\pi_k$, collect a roll-out, learn useful stuff (hopefully) and refine its policy for the subsequent round.
Our metric for success will be the regret:
$$
\begin{aligned}
\text{Regret}(K) &:= \sum_{k=1}^K v_1^\star(s_1) - v_1^{\pi_k}(s_1)\;,\\
&= K v_1^\star(s_1) - \sum_{k=1}^K v_1^{\pi_k}(s_1) \; .
\end{aligned}
$$
The regret lives in $[0, K]$.
If we fail to learn anything, we expect $\text{Regret}(K)$ to grow linearly with $K$ as we fail to
get closer the $\pi^\star$.
On the other hand, if we manage to prove that $\text{Regret}(K) = \mathcal{O}(K^\alpha)$ with $\alpha<1$,
we will know that we are actually learning something – the regret sublinear growth will be a testimony that the
sub-optimality gap $v_1^\star(s_1) - v_1^{\pi_k}(s_1)$ is getting smaller and smaller. Naturally, the smaller the $\alpha$, the better.

**Note**

Beyond the growth of the regret w.r.t $K$ we are also interested in the dependency w.r.t to the horizon $H$, the size of the state space $S=\vert \mathcal{S}\vert$ and the action space $A=\vert \mathcal{A}\vert$, since it captures how the size of the MDP impacts the performance. Other dependencies and universal numerical constants will be ignored, and hidden in the symbol $\blacksquare$. We will also denot $\delta\in(0, 1]$ the algorithm’s confidence level.

## $\texttt{UCB-VI}$

### Rationale

Let’s first discuss why a naive approach would fail. It can be tempting to adopt a greedy strategy: at round $k$, build some stimate $\hat{\mathcal{P}}_k$ of the transition kernel based on the observed transitions seen so far, run the VI algorithm on $\hat{\mathcal{M}}_k = (\mathcal{S}, \mathcal{A}, \hat{\mathcal{P}}_k, r)$ to compute $\pi_{k+1}$, and go on like that. This strategy, of course, fails. Indeed, an unlucky transition at the very beginning of the experiment might lead you to completely abandon a part of the state space, which is actually quite rewarding. To avoid this, we need some level of exploration. This exploration must be reasonable: if we explore all the time, we cannot expect sublinear regret. This is the classical exploration / exploitation trade-off.

A principle way to find the correct balance between exploration and exploitation is the celebrated Optimism in Face of Uncertainty (OFU) principle.
The idea is simple: based on the logged transitions, we will build some confidence interval in which the true transition kernel lies with high-probability.
With some pretty terrible notations, think of something like $\mathcal{P}\in[\mathcal{P}^-, \mathcal{P}^+]$, with high probability.
We will then be looking for an *optimistic* MDP: one with a plausible transition kernel (*i.e* one that lies inside the confidence interval) that guarantees the highest possible value:
$$
\mathcal{M}^\text{opt} \in \argmax_{[\mathcal{P}^-, \mathcal{P}^+]} v_1^\star(s_1) \; .
$$
We will then run VI on this optimistic MDP to create $\pi_{k+1}$. The rationale is simple: if we were right to be optimistic, and we get a lot of rewards.
If we were wrong, we are learning something useful for the future.

That’s in a nutshell, the driving principle behind $\texttt{UCB-VI}$.
You will see that the implementation looks a bit different; instead of searching for an optimistic MDP, we will use an exploration bonus when running the value-iteration algorithm.
This exploration bonus is based on the confidence interval’s width.
You can show that both approaches are *strictly* equivalent. However, the bonus one is much easier to implement in practice.

**Note**

### Algorithm

The rationale behind $\texttt{UCB-VI}$ is rather straight forward. After each roll-out, we will compute estimates $\hat{p}_k$ of the transition probabilities, along with an exploration bonus $b_k$ for each state-action pair. We will compute the value functions $\{v_1^k, q_1^k, \ldots, q_{H}^k, v_{H+1}^k\}$ by running a modified VI algorithm backward in time. This modified VI protocol will involve the estimates $\hat{p}_k$ to mimic the forward dynamics and the exploration bonus $b_k$ to promote exploration of promising state-action pairs. The policy that will collect the roll-out will be greedy w.r.t those value functions (and henceforth deterministic). Formally, $\pi_k = (d_1^k, \ldots, d_H^k)$ where: $$ d_{h}^{k+1}(s) \in \argmax_{a\in\mathcal{A}} q_h^k(s, a) \; . $$ with ties broken arbitrarily.

We are now ready to detail some pseudo-code for $\texttt{UCB-VI}$. Below, we denote $s_h^k$ (resp. $a_h^k$) the state (resp. action) encountered at round $h$ during the collection of episode $k$.

Observe that the q-values are capped at $H$, their maximal possible value given the bounded reward assumption. This also holds true for state-action pairs that have not been visited yet—this is a technique known as optimistic initialisation, which encourages the learning algorithm to quickly visit all action pairs.

We are now ready to provide a regret bound for $\texttt{UCB-VI}$. Because the regret is a random variable, this bound holds only with high-probability.

Let’s enjoy for a moment the nice $\sqrt{K}$ regret growth, ensuring us of $\texttt{UCB-VI}$’s statistical efficiency. We also notice in the regret leading term a sublinear dependency $\sqrt{SA}$ in the size of the MDP. Actually, the known regret lower-bound is of the order $\tilde{\Omega}(\sqrt{HSAK})$, which $\texttt{UCB-VI}$ matches but for the growth with the horizon length $H$. (A variant of $\texttt{UCB-VI}$ presented in [Azar & al, 2017] matches this lower-bound by using a refined alternative for the exploration bonus, but we won’t explore this version today.)

**Note**

## Analysis

We now move into the fun part where we get to prove the claimed upper-bound. The recipe for bounding the regret is somewhat classical once you are used to this type of algorithm. We will first remove the unknown quantity (the ground-truth $v_h^\star$) by invoking optimism, and then show that the online error gracefully degrades. First, we will need to provide a high-probability bound on the estimation error by resorting to a concentration of measure argument.

### Concentration inequalities

To give a somewhat self-contained proof, we first provide some remainder on concentration inequalities. Throughout, we will use the Chernoff-Hoeffding (CH) inequality for i.i.d and martingale difference sequence noise, as well as Bernstein inequality.

We state (and prove) a functional version of the Chernoff-Hoeffding bound. Let $\mathcal{X}$ a finite set, $\nu$ a p.m.f over $\mathcal{X}$ and $f: \mathcal{X}\mapsto\mathbb{R}$ a bounded real-valued function. Let $x^-, x^+\in\mathbb{R}$ such that $f(x) \in [x^-, x^+]$ for any $x\in\mathcal{X}$, and $F:= x^+-x^-$. Given $(x_1, \ldots, x_n)$ some i.i.d samples from $\nu$, the MLE estimator of the probability mass allocated to some $x\in\mathcal{X}$ writes $ \hat\nu_n(x):= \sum_{x_i} \mathbf{1}[x_i=x]/n \text{ for any } x\in\mathcal{X}\; . $ Let $\mu := \mathbb{E}_\nu[f]$ the expected value of $f$ under $\nu$ and $\hat\mu_n := \sum_{x\in\mathcal{X}} \hat\nu_n(x) f(x) $ its estimator. Then for any $\delta\in(0, 1]$:

$$ \tag{1} \mathbb{P}\left(\vert \mu - \hat\mu_n\vert \leq \frac{2F}{\sqrt{n}} \sqrt{\log(2/\delta)}\right) \geq 1-\delta; . $$

**Proof**

We will also need the Azuma-Hoeffding bound, stated below in its “normal” form.
Let $\{\eta_i\}_i$ some martingale difference sequence (*i.e* $\mathbb{E}[\eta_t \vert \eta_{<t}]=0$)
such that $\vert \eta_t\vert < c$ a.s. for any $t\in\mathbb{N}$. Then:
$$
\tag{2}
\mathbb{P}\left(\left\vert \sum_{i=1}^n \eta_i \right\vert \leq c\sqrt{2n\log(2/\delta)}\right)\geq 1-\delta\; .
$$

**Proof**

We will also need Bernstein inequality—an alternative to CH which takes variance effects into account. Letting $\{\eta_i\}_i$ be a collection of i.i.d centered random variable absolutely bounded by some constant $c$, and denoting $\sigma^2 = \text{Var}(\eta_i)$ we have for any $\delta\in(0,1]$: $$ \tag{3} \mathbb{P}\left(\left\vert \frac{1}{n}\sum_{i=1}^n \eta_i \right\vert \leq \sqrt{\frac{\sigma^2}{n}}\sqrt{\log(2/\delta)} + \frac{2c}{3n}\log(2/\delta)\right) \geq 1-\delta\;. $$

### High-probability event

The first step in bounding the regret starts by amputing it of the unknown quantity $v_1^\star$.
This is done via optimism: it actually turns out that, *often enough*, we are optimistic: $v_1^k(s_1) \geq v_1(s_1)$ for any $k\in{1, \ldots, K}$.
That’s useful, as this allows us to write:
$$
\text{Regret}(T) \leq \sum_{k=1}^K v_1^k(s_1) - v_1^{\pi_k}(s_1) \; ,
$$
which we will in turn see is small.
Before that, we need to quantify what is “often”. We lower-bound the probability of an event that is even larger that what we need for optimism.

The proof involves backward induction, applying the functional CH bound at every step, along with some union bounds here and there. In the following lines, fix $k\leq K$. Observe that $v_{H+1}^k \equiv v_{H+1}^\star\equiv 0$ by construction. Now, assume that $v_{h+1}^k(s) \geq v_{h+1}^\star(s)$ for some $h\leq H$ and any $s\in\mathcal{S}$. Also, for any $s, a\in\mathcal{S}\times\mathcal{A}$: $$ \begin{aligned} q_{h}^k(s, a) &= r(s, a) + \sum_{s^\prime}\hat{p}_k(s^\prime\vert s, a) v_{h+1}^k(s^\prime) + b_k(s, a)\;, &\text{(def)}\\ &\geq r(s, a) + \sum_{s^\prime}\hat{p}_k(s^\prime\vert s, a) v_{h+1}^\star(s^\prime) + b_k(s, a)\;, &\text{(induction hyp.)}\\ &= r(s, a) + \sum_{s^\prime}p(s^\prime\vert s, a) v_{h+1}^\star(s^\prime) + \sum_{s^\prime}\left(\hat{p}(s^\prime\vert s, a)-p(s^\prime\vert s, a)\right) v_{h+1}^\star(s^\prime) + b_k(s, a)\;, \\ &= q_h^\star(s, a) + \sum_{s^\prime}\left(\hat{p}(s^\prime\vert s, a)-p(s^\prime\vert s, a)\right) v_{h+1}^\star(s^\prime) + b_k(s, a)\;. &\text{(def of $q_h^\star$)}\\ \end{aligned} $$ An application of CH along with a union bound over $\mathcal{S}\times\mathcal{A}$ yields that with probability at least $1-\delta/(KH)$: $$ \sum_{s^\prime}\left(\hat{p}(s^\prime\vert s, a)-p(s^\prime\vert s, a)\right) v_{h+1}^\star(s^\prime) \geq -b_k(s, a)\quad\text{ for all } s, a\in\mathcal{S}\times\mathcal{A}\;, $$ hence concluding the induction, as under this event we have that $q_{h}^k(s, a) \geq q_h^\star(s, a)$. The same ordering follows for the state-value function since $$ v_h^{k}(s) = \max_{a}q_h^{k}(s, a) \geq \max_{a}q_h^{\star}(s, a) = v_h^\star(s) \; . $$ Only a union bounded over $h\leq H$ and $k\leq K$ is then needed to prove the desired claim.

**Proof**

### Bounding the regret

At this point, we know that with probability at least $1-\delta$ we have: $$ \begin{aligned} \text{Regret}(T) &\leq \sum_{k=1}^K v_1^k(s_1) - v_1^{\pi_k}(s_1) \; , \\ &= \sum_{k=1}^K\Delta_1^k(s_1)\;, \end{aligned} $$ where $\Delta_h^k(s):=v_h^{k}(s) - v_{h}^{\pi^k}(s)$. Let’s see how we can show that the r.h.s is small enough. Our strategy is to bound this term in a recursive fashion, relating $\Delta_h^k$ to $\Delta_{h+1}^k$. $$ \begin{aligned} \Delta_h^k(s_h^k) &= v_h^{k}(s_h^k) - v_{h}^{\pi^k}(s_{h}^k)\;,\\ &= q_h^k(s_h^k, a_h^k) - q_h^{\pi_k}(s_h^k, a_h^k)\;, &(a_h^k= \in \argmax_a q_h^k(s, a)) \;, \\ &= \sum_{s^\prime} \hat p_k(s^\prime\vert s_h^k, a_h^k) v_{h+1}^k(s^\prime) + b_k(s_h^k, a_h^k)- \sum_{s^\prime} p(s^\prime\vert s_h^k, a_h^k) v_{h+1}^{\pi_k}(s^\prime)\;, &(\text{def}) \\ &= \sum_{s^\prime} (\hat p_k-p)(s^\prime\vert s_h^k, a_h^k) v_{h+1}^k(s^\prime) + b_k(s_h^k, a_h^k)- \sum_{s^\prime} p(s^\prime\vert s_h^k, a_h^k)\Delta_{h+1}^k(s^\prime)\; \\ &= \underbrace{\sum_{s^\prime} (\hat p_k-p)(s^\prime\vert s_h^k, a_h^k) v_{h+1}^\star(s^\prime)}_{\textcircled{1}} + \underbrace{\sum_{s^\prime} (\hat p_k-p)(s^\prime\vert s_h^k, a_h^k) (v_{h+1}^k - v_{h+1}^\star)(s^\prime)}_{\textcircled{2}} + b_k(s_h^k, a_h^k)- \underbrace{\sum_{s^\prime} p(s^\prime\vert s_h^k, a_h^k)\Delta_{h+1}^k(s^\prime)}_{\textcircled{3}}\;, \\ \end{aligned} $$ where the last two lines are simply re-arranging.

We have actually already bounded $\textcircled{1}$ a few lines ago. Under $\mathcal{E}$: $$ \textcircled{1} \leq b_k(s_h^k, a_h^k) \; . $$

Treating $\textcircled{2}$ is slightly more involved, but the bottom-line is that this term relates to $\textcircled{3}$: $$ \textcircled{2} \leq \tilde\blacksquare \left(H^{-1} \textcircled{3} + \frac{SH^2}{n_k(s_h^k, a_h^k)}\right)\;. $$

**Proof**

Putting everything together, we obtain the high probability bound: $$ \begin{aligned} \Delta_h^k(s_h^k) \leq \tilde\blacksquare \left( b_k(s_h^k, a_h^k) + (1+1/H) \sum_{s^\prime} p(s^\prime\vert s_h^k, a_h^k)\Delta_{h+1}^k(s^\prime) + \frac{SH^2}{n_k(s_h^k, a_h^k)}\right)\;. \end{aligned} $$

We almost have the recursive relationship we were gunning for – the r.h.s is still missing an explicit mention to $\Delta_h^k(s_{h+1}^k)$. By adding and removing $(1+H)^{—1}\Delta_h^k(s_{h+1}^k)$ and introducing the terms $\xi_{h}^k := v_{h+1}^k(s_{h+1}^k) - \mathbb{E}_{s_h^k}^{a_h^k}[v_{h+1}^k(s^\prime)]$ and $\zeta_{h}^k := v_{h+1}^{\pi_k}(s_{h+1}^k) - \mathbb{E}_{s_h^k}^{a_h^k}[v_{h+1}^{\pi_k}(s^\prime)]$ we finally obtain: $$ \Delta_h^k(s_h^k) \leq \tilde\blacksquare \left( b_k(s_h^k, a_h^k) + (1+1/H)(\xi_{h+1}^k + \zeta_{h+1}^k + \Delta_{h+1}^k(s_{h+1}^k)) + \frac{SH^2}{n_k(s_h^k, a_h^k)}\right)\;. $$

Observe that $\{\xi_h^k\}_h$ and $\{\zeta_h^k\}_h$ are martingale difference sequences – for instance, $\mathbb{E}[\xi_h^k \vert s^k_{h^\prime < h}] = 0$. Unrolling, and using the fact that $(1+1/x)^x \leq 3$ we eventually obtain that: $$ \Delta_1^k(s_1^k) \leq \tilde\blacksquare \sum_{h=1}^H b_k(s_h^k a_h^k) + \xi_h^k + \zeta_h^k + \Delta_{h+1}^k(s_{h+1}^k) + \frac{SH^2}{n_k(s_h^k, a_h^k)} \; , $$ and therefore: $$ \tag{4} \text{Regret}(T) \leq \tilde\blacksquare \sum_{k=1}^K \sum_{h=1}^H b_k(s_h^k a_h^k) + \xi_h^k + \zeta_h^k + \frac{SH^2}{n_k(s_h^k, a_h^k)} \; , $$ The last term is a second-order term; it is easy to show that: $$ \sum_{k, h} \frac{1}{n_k(s_h^k, a_h^k)} \leq \tilde\blacksquare SAH \; . $$

**Proof**

By a direct application of the Azuma-Hoefffing inequality $(2)$, and using the fact that both $\vert \xi_k^h \vert \leq H$ and $\vert \zeta_k^h \vert \leq H$, we have that with high-probability: $$ \sum_{h=1}^H \sum_{k=1}^K \xi_h^k + \zeta_h^k \leq \blacksquare H\sqrt{KH} \; . $$ We are left with bounding the first term in (4). Using the definition of $b_k$ we have: $$ \begin{aligned} \sum_{k=1}^K\sum_{h=1}^H b_k(s_h^k, a_h^k) &= \tilde\blacksquare H \sqrt{HKSA}\; . \end{aligned} $$

**Proof**

Combining everything together, we finally obtain the claimed high-probability bound: $$ \text{Regret}(T) \leq \tilde\blacksquare \left(H\sqrt{HSAK} + H^3S^2 A\right)\;. $$

**Note**

### Outlook

At this point, we are now convinced that $\texttt{UCB-VI}$ is statistically efficient, with a leading regret term scaling as $\mathcal{O}(H\sqrt{HSAK})$. It gives us a principled tool to reduce epistemic uncertainty, and properly balance the exploration-exploitation via optimism.

Of course, $\texttt{UCB-VI}$’ practical usefulness doesn’t reach much further than small MDPs since only storing $\hat p_k$ is $\Omega(S^2A)$.
The main ideas can be extended to structured MDPs, and do support simple (*i.e* linear) function approximation.
It is natural to wonder whether optimism can be used as a plug-in in the multitude of RL algorithms powered by deep-learning, which painfully miss strategic exploration.
Unfortunately, the answer so far has been mostly negative, for two reasons: the complexity of building meaningful confidence sets for neural network parameters, and the considerable computational overhead required to search for an optimistic parametrisation of such models.
Instead, there has been some limited yet convincing success porting ideas from another paradigm: Thompson sampling (Bayesian stuff). That would be for another time, though.