Control in Entropy Regularized MDPs

This post introduces the Maximum Entropy (MaxEnt) framework for Reinforcement Learning. The ambition is to see how “standard” control algorithms like value or policy iteration translate to the MaxEnt setting, and describe their convergence properties. We will also discuss how said control approach influenced the design of some popular modern algorithms such as the Soft Actor Critic (SAC).

Oldies but goodies: MLE

Maximum-Likelihood Estimation (MLE) is ubiquitous in modern (supervised) machine learning, to the point that we sometimes forget it is the very principle behind the criterion we use to train fancy models. Justifying the MLE’s current popularity from a statistical perspective only would be a stretch; nonetheless, it is useful to remember that it enjoys some comfortable (asymptotic) statistical properties. This post provides proof for two such attributes: consistency and asymptotic normality.

Model-based RL (1/?)

This post is the first of a short series, concerned with model-based RL. We will start walking this road via the principled trail: describing and analysing $\texttt{UCB-VI}$, a theoretically grounded algorithm to solve unknown finite MDPs. More precisely, we will see how this model-based approach directly models the unknown MDP and uses optimism for strategic exploration to provably find “good” policies in finite-time. Concentration inequalities and regret bounds incoming, fun!

Oldies but goodies: Ergodic Theorem

The goal of this blog post is to provide a self-contained proof of the so-called fundamental theorem of Markov chains. It states that provided some basic properties (spoiler: ergodicity) any Markov chain random walk converges to the same limiting distribution—irrespective of the starting state. A hidden goal is to introduce some important tools for the analysis of the average-reward criterion in MDPs.

From VI to DQN

The goal of this blog post is walk the path that saw the Value Iteration algorithm mature into a variety of modern value-based reinforcement learning methods empowered by deep learning. In particular, we will first see how VI gave birth to the Q-learning algorithm. We will then see how this tabular algorithm was gradually modified to handle function approximation and give rise to the watershed DQN algorithm.

RL Landscape (1/?)

This blog-post is the first of a short series on the modern algorithmic Reinforcement Learning landscape. The series’ objective is 1) to lay out the borders of some “boxes” where one can fit most modern RL algorithms, 2) define said boxes’ content in an almost self-contained fashion and 3) discuss each algorithmic family’s strengths and weaknesses. This first post provides a very brief description of what those boxes are.

Multi-Agent Dynamical Systems (3/3)

We slightly switch gear in this third blog-post of the multi-agent series. Letting go of game theoretic concepts, we instead discuss some training paradigms for multi-agent reinforcement learning. We will cover the main limitations behind centralised and independent learning, to land at the centralised training with decentralised execution (CTDE), arguably the more established framework to train autonomous agents in decentralised multiplayer games.

Multi-Agent Dynamical Systems (2/3)

This second blog-post of the multi-agent series is dedicated to solution concepts: how we can characterize and compare different joint policies. We will cover Pareto optimality, the definition and existence of Nash equilibrium, as well as the minimax theorem in two-agents zero-sum games.

Multi-Agent Dynamical Systems (1/3)

This blog-post is the first of a short serie on Multi-Agent control and RL. The objective here is to cover the different models of multi-agent interactions, from repeated matrix games to partially observable Markov games. Going up the game hierarchy (from less to more general) we will detail what constitutes valid policies and how they differ from their usual fully and partially observed MDPs cousin.

Expectation Maximization

This post covers the Expectation Maximization (EM) algorithm, a popular heuristic to (approximately) compute maximum likelihood estimates when dealing with unobserved / latent data. We will motivate and derive the EM’s inner mechanisms before making them explicit on two classical examples (Gaussian Mixture parameter estimation and Hidden Markov Model identification).

The Linear Quadratic Regulator

This post is concerned with the Linear Quadratic Regulator (LQR) in discrete-time. The LQR stands as somewhat of a singularity in optimal control theory: the (only?) non-trivial control problem in continuous state and action space for which a closed-form solution is known.

Partial Observability and Belief in MDPs

Partially Observable MDPs allow to model sensor noise, data occlusion, .. in Markov Decision Processes. Partial observability is unquestionably practically relevant, but at first glance this richer interaction setting seems to invalidate the fundamental theorem of MDPs, a.k.a the sufficiency of stationary policies. The goal of this post is to nuance this last statement: we will detail how POMDPs really are just “large” MDPs and therefore enjoy many of the same features.

A Tale of Policy Gradients

This post covers the derivation of several policy gradients expressions. It browses through the vanilla policy gradient and the policy gradient theorem to finish at deterministic policy gradients. No attention is given to optimisation aspects (whether and how policy gradients can find locally optimal policies): the focus is the gradient’s (semi) formal derivation.

MDP Fundamentals (3/3)

This blog-post is concerned with computational methods for finding optimal policies. We will cover Value Iteration (VI) and Policy Iteration (PI), which serve as fundamental building blocks for many modern RL methods. We will also quickly cover the Generalized Policy Iteration approach, on top of which stand all fancy modern actor-critic methods.

MDP Fundamentals (2/3)

In this second blog post of the series, we cover the problem of value prediction and control in finite discounted MDPs. This will lead us in particular to review the Bellman prediction and control equations, and to a fundamental theorem of MDPs: stationary policies are enough for optimal control!

MDP Fundamentals (1/3)

This blog post is the first of a short series on control in discounted discrete MDPs. We will cover here the very basics: the definition of an MDP, the different notions of policies (history-dependent, Markovian, stationary,…), as well as the discounted objective for MDPs.