# MDP Fundamentals (1/3)

This blog post is the first of a short series on control in discounted discrete MDPs. We will cover here the very basics: the definition of an MDP, the different notions of policies (history-dependent, Markovian, stationary,…), as well as the discounted objective for MDPs.

## Markov Decision Processes

Markov Decision Processes (MDPs) stand as a fundamental formalisation of sequential decision-making and control problems. Throughout this series, we will focus on discrete MDPs – with finite or countable state and action spaces. Most ideas are portable to continuous MDPs; their proper treatment requiring slightly more subtle measure theoretic arguments, we exclude them for the sake of simplicity.

### Definition

The first ingredients to a fully-observable, stationary MDP are a state space $\mathcal{S}$ and an action space $\mathcal{A}$.
As announced we will focus here on the case where $\mathcal{S}$, $\mathcal{A}$ are discrete – that is, finite or countable.
The *stationary* dynamics between different states is encoded by some transition kernel $p(\cdot\vert s, a)$, where:
$$
p(s’\vert s, a) = \mathbb{P}(s_{t+1} = s’\vert s_t=s, a_t=a) \; \text{ for any } t\geq 1, \; s\in\mathcal{S},
\;a\in\mathcal{A}, \;s’\in\mathcal{S}\; .
$$
A stationary reward function $r(s, a)$ measures the quality of action $a$ when executed in state $s$. That’s basically it;
a MDP $\mathcal{M}$ is a collection of a state-space, an action-space, a probabilistic rule of transition to states from state-action pairs,
and a measure of success through the reward function.

### Decision-rule and Policies

Solving an MDP informally means that we want to generate trajectories $(s_1, a_1, \ldots s_t, a_t, \ldots)$ yielding large cumulated rewards (we will see in a bit how we measure that exactly).

#### Decision-rules

At a given round $t$, an agent trying to solve the MDP is able to pick a *decision-rule* $d_t$
that maps the history up to round $t$ to the space of distribution over $\mathcal{A}$ - denoted $\Delta_\mathcal{A}$. Formally;
$$
\begin{aligned}
d_t : (\mathcal{S}\times\mathcal{A})^{t-1}\times\mathcal{S}&\longmapsto \Delta_\mathcal{A}\\
h_t = (s_1, a_1, \ldots, a_{t-1}, s_t) &\longmapsto d_t(h_t)
\end{aligned}
$$
The next action that is played is $a_t \sim d_t(h_t)$.
This space of function is called the space of history-dependent, randomised strategies and denoted $\mathcal{D}^{\text{HR}}$.
It contains other strategies of interest like history-dependent deterministic $\mathcal{D}^{\text{HD}}$, Markovian randomized $\mathcal{D}^{\text{MR}}$ and Markovian deterministic $\mathcal{D}^{\text{MD}}$ decision-rules.
A Markovian deterministic decision-rule $d$ takes as an argument only the current state and returns a unique control;
*i.e.* $d\in\mathcal{D}^\text{MD} \Longrightarrow d:\mathcal{S}\mapsto\mathcal{A}$.
The different classes of decision-rules obey the following relationships;
$$
\boxed{
\begin{aligned}
\mathcal{D}^{\text{MD}}\subset \mathcal{D}^{\text{HD}} \subset &\; \mathcal{D}^{\text{HR}}\\
& \; \cup \\
\mathcal{D}^{\text{MD}}\subset &\mathcal{D}^{\text{MR}}
\end{aligned}
}
$$

#### Policies

A *policy* $\pi=(d_1, \ldots, d_T, \ldots)$ is a sequence (potentially infinite) of decision rules – the
different spaces of policies being denoted $\Pi^{\text{MD}}$, $\Pi^{\text{MR}}$, $\Pi^{\text{HD}}$ and $\Pi^{\text{HR}}$.
History-dependent, randomised policies are the most general and can represent any decision-making agent.
They might not be the easiest to handle and represent, as they might require infinite memory.
Arguably the most basic policy sub-class is the one of *stationary* policies, denoted $\mathcal{S}^{\text{MD}}$.
An agent playing with a stationary policy applies at each time step the same decision-rule:
$$
\pi\in\mathcal{S}^{\text{MD}} \Longrightarrow \pi = (d, d, \ldots), \text{ where }d\in\mathcal{D}^\text{MD}\; .
$$
We will see that this light-weight policy class is actually *enough* to solve an MDP (at least for the objective
that we care about).
This will make finding an *optimal* policy (whatever that means for now) reasonable from
a computational standpoint.

#### Important Notations

Given an initial state $s\in\mathcal{S}$, the combination of any policy $\pi\in\Pi^{\text{HR}}$
with the transition probability kernels $p$ induces a probability measure over the sequence of visited states and actions.
We will use the shorthand notation $\mathbb{P}_s^{\pi}$ to denote this probability measure - and some derived measures,
*e.g.* the probability measure obtained by marginalising over the states. For instance,
when $\pi=(d_1,\ldots,)\in\Pi^{\text{HR}}$ we will note;
$$
\mathbb{P}_s^{ \pi}\left(s_{t+1}=s’\right) = \mathbb{P}\left(s_t=s’\middle\vert s_1=s, \; a_i \sim d_i(h_i), \; s_{i+1}\sim p_t(\cdot\vert s_{i}, a_{i}) \text{ for } i\leq t\right)\; .
$$
We will adopt similar notations when reasoning about expectations; for instance:
$$
\mathbb{E}_s^{\pi}\left[r(s_t,a_t)\right] = \mathbb{E}\left[r(s_t,a_t)\middle\vert s_1=s, \; a_i \sim d_i(h_i), \; s_{i+1}\sim p_t(\cdot\vert s_{i}, a_{i}) \text{ for } i\leq t\right] \;.
$$

### Discounted Objective

Let’s now get a bit more precise of what constitutes a “good” policy. We will focus on the *discounted* approach.
Discounting naturally arises in situations where the decision-maker prefer to adopt *myopic* strategies: it is
more concerned about the immediate reward of its actions rather than benefits he might enjoy far in the future.

Let $\lambda\in[0, 1)$ the *discount* factor.
It accounts for the decision-maker’s degree of conservatism; the closer to 0, the more preoccupied it is about immediate reward.
The $\lambda$-discounted reward of a policy $\pi\in\Pi^{\text{HR}}$ is:

Our first reflex should be to check that this definition actually makes sense - that is, the limit exists and is bounded. It is indeed safe when rewards are bounded and $\lambda<1$, and we can swap limit and expectation: $$ v_\lambda^\pi(s) := \mathbb{E}_s^\pi\left[ \sum_{t=1}^\infty \lambda^{t-1}r(s_t, a_t)\right] \; . $$

**Proof**

Notice how, by marginalising over state and action, we can rewrite the discounted objective as: $$ v_\lambda^\pi(s) = \sum_{s’, a’\in\mathcal{S}\times\mathcal{A}} r(s’, a’) \sum_{t=1}^\infty \lambda^{t-1} \mathbb{P}_s^\pi(s_t=s’, a_t = a’) \; . $$ It turns out that for any history-dependent policy $\pi\in\Pi^{\text{HR}}$, we can find some Markovian policy $\pi’\in\Pi^\text{MR}$ such that there state-action visit frequencies are the same: $$ \mathbb{P}_s^\pi(s_t=s’, a_t = a’) = \mathbb{P}_s^{\pi’}(s_t=s’, a_t = a’), \; \text{ for any } s’, a’\in\mathcal{S}\times\mathcal{A} $$

**Proof**

In eye of the previous re-writing of $v_\lambda^\pi$, this will mark our first important result of this series, taking the first step towards trimming down the very large space of history-dependent policies. Indeed, we just showed that for any $\pi\in\Pi^{\text{HR}}$ and $s\in\mathcal{S}$, we can construct some $\pi’\in\Pi^\text{MR}$ such that $v_\lambda^\pi(s) = v_\lambda^{\pi’}(s).$ Therefore, it is definitely enough to focus only on $\Pi^\text{MR}$, which is a much smaller policy class compared to $\Pi^\text{HR}$!

**Note**

### Vectorial Notations

We now introduce some notations that will allow us some more compact writing.
Let $\mathcal{V}$ be the space of bounded functions mapping $\mathcal{S}$ to $\mathbb{R}$.
When $\mathcal{S}$ is *finite* - *i.e* $\vert \mathcal{S} \vert = n\in\mathbb{N}$, we can represent such functions
as $n$-dimensional vectors. Indeed, writing $\mathcal{S} = \{ s^1, \ldots, s^n\}$ we will associate for any
$f\in\mathcal{V}$ a vector $\mathbf{f}\in\mathbb{R}^n$ where;
$$
\forall i\in{1, \ldots, n}, \; [\mathbf{f}]_i = f(x^i)\in\mathbb{R}; .
$$

In the finite case, we will therefore identify $\mathcal{V}$ with $\mathbb{R}^n$ and use the so-called vectorial notation for finite MDPs to further reduce clutter. We list below such notations before giving out a few identities.

In the following, $d\in\mathcal{D}^\text{MR}$ is a Markovian randomized decision rule and $\pi\in\Pi^\text{MR}$. The three main entities we will work with are the reward vector $\mathbf{r}_{d}$, the instantaneous transition matrix $\mathbf{P}_{d}$ and the transition matrix $\mathbf{P}_{\pi}^{t}$. In the definitions to come, we index the entries of vectors and matrix with $s,s’\in\mathcal{S}$.

For instance, the entry located at the $s^{\text{th}}$ row and $s’^{\text{th}}$ column of $\mathbf{P}_{d}$ evaluates the probability of transitioning from $s$ to $s’$ when following the randomised policy $d$. Note that $\mathbf{r}_{d}\in\mathcal{V}$ while $\mathbf{P}_{d}$ and $\mathbf{P}_{\pi}^{t}$ are linear operators on $\mathcal{J}$ to itself.

The following identities will be useful in our proofs; the first ties the transition matrix of a policy $\pi=(d_1, \ldots)\in\Pi^\text{MR}$ to the instantaneous transition matrices; $$ \mathbf{P}_{\pi}^{t} = \mathbf{P}_{\pi}^{t-1} \mathbf{P}_{d_t} = \prod_{i=1}^t \mathbf{P}_{d_i} $$

**Proof**

The second allows to rewrite the expected discount value of $\pi=(d_1, \ldots)$ in a fairly compact manner; $$ v_\lambda^\pi = \sum_{t=1}^{+\infty} \lambda^{t-1} \mathbf{P}^{\pi}_{t} \mathbf{r}_{d_{t}} \; . $$ where $\mathbf{P}^{\pi}_{1}=\mathbf{I}_n$. Above, we write the expected discounted value $v_\lambda^\pi\in\mathcal{V}$ in its vectorial form.

**Proof**

## Resources

Most of this blog-post is a condensed version of [Puterman. 94, Chapter 4&5]