MDP Fundamentals (1/3)

This blog post is the first of a short series on control in discounted discrete MDPs. We will cover here the very basics: the definition of an MDP, the different notions of policies (history-dependent, Markovian, stationary,…), as well as the discounted objective for MDPs.

Markov Decision Processes

Markov Decision Processes (MDPs) stand as a fundamental formalisation of sequential decision-making and control problems. Throughout this series, we will focus on discrete MDPs – with finite or countable state and action spaces. Most ideas are portable to continuous MDPs; their proper treatment requiring slightly more subtle measure theoretic arguments, we exclude them for the sake of simplicity.


The first ingredients to a fully-observable, stationary MDP are a state space $\mathcal{S}$ and an action space $\mathcal{A}$. As announced we will focus here on the case where $\mathcal{S}$, $\mathcal{A}$ are discrete – that is, finite or countable. The stationary dynamics between different states is encoded by some transition kernel $p(\cdot\vert s, a)$, where: $$ p(s’\vert s, a) = \mathbb{P}(s_{t+1} = s’\vert s_t=s, a_t=a) \; \text{ for any } t\geq 1, \; s\in\mathcal{S}, \;a\in\mathcal{A}, \;s’\in\mathcal{S}\; . $$ A stationary reward function $r(s, a)$ measures the quality of action $a$ when executed in state $s$. That’s basically it; a MDP $\mathcal{M}$ is a collection of a state-space, an action-space, a probabilistic rule of transition to states from state-action pairs, and a measure of success through the reward function.

$$ { \mathcal{M} := (\mathcal{S}, \mathcal{A}, p, r) } $$

Decision-rule and Policies

Solving an MDP informally means that we want to generate trajectories $(s_1, a_1, \ldots s_t, a_t, \ldots)$ yielding large cumulated rewards (we will see in a bit how we measure that exactly).


At a given round $t$, an agent trying to solve the MDP is able to pick a decision-rule $d_t$ that maps the history up to round $t$ to the space of distribution over $\mathcal{A}$ - denoted $\Delta_\mathcal{A}$. Formally; $$ \begin{aligned} d_t : (\mathcal{S}\times\mathcal{A})^{t-1}\times\mathcal{S}&\longmapsto \Delta_\mathcal{A}\\ h_t = (s_1, a_1, \ldots, a_{t-1}, s_t) &\longmapsto d_t(h_t) \end{aligned} $$ The next action that is played is $a_t \sim d_t(h_t)$. This space of function is called the space of history-dependent, randomised strategies and denoted $\mathcal{D}^{\text{HR}}$. It contains other strategies of interest like history-dependent deterministic $\mathcal{D}^{\text{HD}}$, Markovian randomized $\mathcal{D}^{\text{MR}}$ and Markovian deterministic $\mathcal{D}^{\text{MD}}$ decision-rules. A Markovian deterministic decision-rule $d$ takes as an argument only the current state and returns a unique control; i.e. $d\in\mathcal{D}^\text{MD} \Longrightarrow d:\mathcal{S}\mapsto\mathcal{A}$. The different classes of decision-rules obey the following relationships; $$ \boxed{ \begin{aligned} \mathcal{D}^{\text{MD}}\subset \mathcal{D}^{\text{HD}} \subset &\; \mathcal{D}^{\text{HR}}\\ & \; \cup \\ \mathcal{D}^{\text{MD}}\subset &\mathcal{D}^{\text{MR}} \end{aligned} } $$


A policy $\pi=(d_1, \ldots, d_T, \ldots)$ is a sequence (potentially infinite) of decision rules – the different spaces of policies being denoted $\Pi^{\text{MD}}$, $\Pi^{\text{MR}}$, $\Pi^{\text{HD}}$ and $\Pi^{\text{HR}}$. History-dependent, randomised policies are the most general and can represent any decision-making agent. They might not be the easiest to handle and represent, as they might require infinite memory. Arguably the most basic policy sub-class is the one of stationary policies, denoted $\mathcal{S}^{\text{MD}}$. An agent playing with a stationary policy applies at each time step the same decision-rule: $$ \pi\in\mathcal{S}^{\text{MD}} \Longrightarrow \pi = (d, d, \ldots), \text{ where }d\in\mathcal{D}^\text{MD}\; . $$ We will see that this light-weight policy class is actually enough to solve an MDP (at least for the objective that we care about). This will make finding an optimal policy (whatever that means for now) reasonable from a computational standpoint.

Important Notations

Given an initial state $s\in\mathcal{S}$, the combination of any policy $\pi\in\Pi^{\text{HR}}$ with the transition probability kernels $p$ induces a probability measure over the sequence of visited states and actions. We will use the shorthand notation $\mathbb{P}_s^{\pi}$ to denote this probability measure - and some derived measures, e.g. the probability measure obtained by marginalising over the states. For instance, when $\pi=(d_1,\ldots,)\in\Pi^{\text{HR}}$ we will note; $$ \mathbb{P}_s^{ \pi}\left(s_{t+1}=s’\right) = \mathbb{P}\left(s_t=s’\middle\vert s_1=s, \; a_i \sim d_i(h_i), \; s_{i+1}\sim p_t(\cdot\vert s_{i}, a_{i}) \text{ for } i\leq t\right)\; . $$ We will adopt similar notations when reasoning about expectations; for instance: $$ \mathbb{E}_s^{\pi}\left[r(s_t,a_t)\right] = \mathbb{E}\left[r(s_t,a_t)\middle\vert s_1=s, \; a_i \sim d_i(h_i), \; s_{i+1}\sim p_t(\cdot\vert s_{i}, a_{i}) \text{ for } i\leq t\right] \;. $$

Discounted Objective

Let’s now get a bit more precise of what constitutes a “good” policy. We will focus on the discounted approach. Discounting naturally arises in situations where the decision-maker prefer to adopt myopic strategies: it is more concerned about the immediate reward of its actions rather than benefits he might enjoy far in the future.

Let $\lambda\in[0, 1)$ the discount factor. It accounts for the decision-maker’s degree of conservatism; the closer to 0, the more preoccupied it is about immediate reward. The $\lambda$-discounted reward of a policy $\pi\in\Pi^{\text{HR}}$ is:

$$ v_\lambda^\pi(s) := \lim_{T\to\infty} \mathbb{E}_s^\pi\left[ \sum_{t=1}^T \lambda^{t-1}r(s_t, a_t)\right], \; s\in\mathcal{S} $$
Discounted objective

Our first reflex should be to check that this definition actually makes sense - that is, the limit exists and is bounded. It is indeed safe when rewards are bounded and $\lambda<1$, and we can swap limit and expectation: $$ v_\lambda^\pi(s) := \mathbb{E}_s^\pi\left[ \sum_{t=1}^\infty \lambda^{t-1}r(s_t, a_t)\right] \; . $$


Notice how, by marginalising over state and action, we can rewrite the discounted objective as: $$ v_\lambda^\pi(s) = \sum_{s’, a’\in\mathcal{S}\times\mathcal{A}} r(s’, a’) \sum_{t=1}^\infty \lambda^{t-1} \mathbb{P}_s^\pi(s_t=s’, a_t = a’) \; . $$ It turns out that for any history-dependent policy $\pi\in\Pi^{\text{HR}}$, we can find some Markovian policy $\pi’\in\Pi^\text{MR}$ such that there state-action visit frequencies are the same: $$ \mathbb{P}_s^\pi(s_t=s’, a_t = a’) = \mathbb{P}_s^{\pi’}(s_t=s’, a_t = a’), \; \text{ for any } s’, a’\in\mathcal{S}\times\mathcal{A} $$


In eye of the previous re-writing of $v_\lambda^\pi$, this will mark our first important result of this series, taking the first step towards trimming down the very large space of history-dependent policies. Indeed, we just showed that for any $\pi\in\Pi^{\text{HR}}$ and $s\in\mathcal{S}$, we can construct some $\pi’\in\Pi^\text{MR}$ such that $v_\lambda^\pi(s) = v_\lambda^{\pi’}(s).$ Therefore, it is definitely enough to focus only on $\Pi^\text{MR}$, which is a much smaller policy class compared to $\Pi^\text{HR}$!


Vectorial Notations

We now introduce some notations that will allow us some more compact writing. Let $\mathcal{V}$ be the space of bounded functions mapping $\mathcal{S}$ to $\mathbb{R}$. When $\mathcal{S}$ is finite - i.e $\vert \mathcal{S} \vert = n\in\mathbb{N}$, we can represent such functions as $n$-dimensional vectors. Indeed, writing $\mathcal{S} = \{ s^1, \ldots, s^n\}$ we will associate for any $f\in\mathcal{V}$ a vector $\mathbf{f}\in\mathbb{R}^n$ where; $$ \forall i\in{1, \ldots, n}, \; [\mathbf{f}]_i = f(x^i)\in\mathbb{R}; . $$

In the finite case, we will therefore identify $\mathcal{V}$ with $\mathbb{R}^n$ and use the so-called vectorial notation for finite MDPs to further reduce clutter. We list below such notations before giving out a few identities.

In the following, $d\in\mathcal{D}^\text{MR}$ is a Markovian randomized decision rule and $\pi\in\Pi^\text{MR}$. The three main entities we will work with are the reward vector $\mathbf{r}_{d}$, the instantaneous transition matrix $\mathbf{P}_{d}$ and the transition matrix $\mathbf{P}_{\pi}^{t}$. In the definitions to come, we index the entries of vectors and matrix with $s,s’\in\mathcal{S}$.

$$ \begin{aligned} [\mathbf{r}_{d}]_s &:= \mathbb{E}_{a\sim d(s)} \left[r(s, a)\right]\; ,\\ [\mathbf{P}_{d}]_{s,s'} &:= \mathbb{P}_{a\sim d(s)}\left(s_t = s'\vert s_{t-1}=s\right) \\ [\mathbf{P}_{\pi}^{t}]_{s,s'} &:= \mathbb{P}^\pi_s\left(s_t = s'\right) \end{aligned} $$
Vector notations

For instance, the entry located at the $s^{\text{th}}$ row and $s’^{\text{th}}$ column of $\mathbf{P}_{d}$ evaluates the probability of transitioning from $s$ to $s’$ when following the randomised policy $d$. Note that $\mathbf{r}_{d}\in\mathcal{V}$ while $\mathbf{P}_{d}$ and $\mathbf{P}_{\pi}^{t}$ are linear operators on $\mathcal{J}$ to itself.

The following identities will be useful in our proofs; the first ties the transition matrix of a policy $\pi=(d_1, \ldots)\in\Pi^\text{MR}$ to the instantaneous transition matrices; $$ \mathbf{P}_{\pi}^{t} = \mathbf{P}_{\pi}^{t-1} \mathbf{P}_{d_t} = \prod_{i=1}^t \mathbf{P}_{d_i} $$


The second allows to rewrite the expected discount value of $\pi=(d_1, \ldots)$ in a fairly compact manner; $$ v_\lambda^\pi = \sum_{t=1}^{+\infty} \lambda^{t-1} \mathbf{P}^{\pi}_{t} \mathbf{r}_{d_{t}} \; . $$ where $\mathbf{P}^{\pi}_{1}=\mathbf{I}_n$. Above, we write the expected discounted value $v_\lambda^\pi\in\mathcal{V}$ in its vectorial form.



Most of this blog-post is a condensed version of [Puterman. 94, Chapter 4&5]