The Linear Quadratic Regulator
This post is concerned with the Linear Quadratic Regulator (LQR) in discrete-time. The LQR stands as somewhat of a singularity in optimal control theory: the (only?) non-trivial control problem in continuous state and action space for which a closed-form solution is known.
Deterministic LQR
Let $d_x, d_u \in\mathbb{N}$. We are concerned with a continuous control problem: let $\mathcal{X} = \mathbb{R}^{d_x}$ its state space and $\mathcal{U} = \mathbb{R}^{d_u}$ its input or control space. The LQR presumes that the system’s dynamics are linear: $$ x_{n+1} = A x_n + B u_n \; , $$ where $A \in\mathbb{R}^{d_x\times d_x}$ and $B \in\mathbb{R}^{d_x \times d_u}$.
Finite-horizon problem
Let $N\in\mathbb{N}$. Under the aforementioned dynamics, we evaluate a sequence of control $(u_0, \ldots, u_{N-1}) \in \mathcal{U}^N$ via a finite-horizon, quadratic criterion: $$ \mathcal{V}(x_0, u_0, \ldots, u_{N-1}) = \sum_{n=0}^{N-1} \left\{ x_n^{\top} Q_n x_n + u_n^\top R_n u_n \right\} + x_N^\top Q_N x_N \; , $$ where $Q_n \in\mathbb{R}^{d_x\times d_x}$ and $R_n \in\mathbb{R}^{d_u\times d_u}$ for $n\in\mathbb{N}$ are both symmetric matrices. We will be looking for controls that makes this criterion small. For our problem to be well-defined, we will further ask that $Q_n$’s are positive semi-definite and $R_t$’s are positive definite. Briefly, this ensures that our criterion is bounded-below and that minimizing $v(\cdot)$ makes some sense. That’s good new, since this is exactly what we now set out to do: $$ \text{ find }(u_0^\star, \ldots, u_{N-1}^\star) \in \argmin_{u_{_0}, \ldots, u_{_{N-1}}} \mathcal{V}(x_0, u_0, \ldots, u_{N-1}) \; . \tag{1} $$ In other words, we want to find an optimal control for a deterministic, finite-time LQR problem.
Just a large optimization problem
It’s easy to realize that (1) is just one large optimization problem - and a quadratic one nonetheless! Indeed, observe that under the linear dynamics: $$ \begin{aligned} x_1 &= Ax_0 + Bu_1 \;, \\ x_2 &= Ax_1 + Bu_1 \;, \\ &= A^2 x_0 + ABu_1 + Bu_1 \;, \\ \vdots \\ x_N &= Ax_{N-1} + Bu_{N-1} \;, \\ &= A^N x_0 + A^{N-1}Bu_1 + \ldots + u_{N-1} \; . \end{aligned} $$ With a bit of re-writing, we basically just said that: $$ \begin{pmatrix} x_0 \\ x_1 \\ \vdots \\ x_{N} \end{pmatrix} = \underbrace{\begin{pmatrix} I_{d_x} \\ A \\ \vdots \\ A^N\end{pmatrix}}_{\mathbf{H}} x_0 + \underbrace{\begin{pmatrix} 0 & \ldots & \ldots& \ldots \\ B & 0 & \ldots& \ldots \\ AB & B & 0& \ldots \\ \vdots \\ A^{N-1}B & A^{N-2}B & \ldots & B \end{pmatrix}}_{\mathbf{G}} \begin{pmatrix} u_0 \\ \vdots \\ u_{N-1} \end{pmatrix} $$
By denoting $X = \begin{pmatrix} x_0^\top & \ldots & x_{N}^\top \end{pmatrix}^\top$ and $U = \begin{pmatrix} u_0^\top & \ldots & u_{N-1}^\top \end{pmatrix}^\top$ we therefore have that: $$ X= \mathbf{G} U + \mathbf{H} x_0 \; , $$
Let’s now rewrite our objective. It is relatively easy to realize that: $$ \begin{aligned} \mathcal{V}(x_0, U) &= \sum_{n=0}^{N-1} \left\{ x_n^{\top} Q_n x_n + u_n^\top R_n u_n \right\} + x_N^\top Q_N x_N \; ,\\ &= X^\top \overbrace{\begin{pmatrix} Q_0 & & \large{0} \\ & \ddots & \\ \large{0} & & Q_N \end{pmatrix}}^{\mathbf{\tilde{Q}}} X + U^\top \overbrace{\begin{pmatrix} R_0 & & \large{0} \\ & \ddots & \\ \large{0} & & R_{N-1} \end{pmatrix}}^{\mathbf{\tilde{R}}} U \\ &= X^\top {\mathbf{\tilde{Q}}}X + U^\top {\mathbf{\tilde{R}}} U \\ &= x_0^\top \mathbf{H}^\top {\mathbf{\tilde{Q}}} \mathbf{H}x_0+ 2 U \mathbf{G}^\top {\mathbf{\tilde{Q}}}\mathbf{H}x_0 + U^\top ({ \mathbf{G}^\top {\mathbf{\tilde{Q}}} \mathbf{G} + \mathbf{\tilde{R}}}) U \; , \end{aligned} $$ where we last used the identity $X= \mathbf{G} U + \mathbf{H} x_0$.
Notice how $\mathcal{V}$ is quadratic in U, and how $\lim_{\| U \| \to \infty} \mathcal{V}(x_0, U) = +\infty$ (coercitivity) since $\mathbf{G}^\top {\mathbf{\tilde{Q}}} \mathbf{G} + \mathbf{\tilde{R}} \succ 0$. As a result, solving $\min_U \mathcal{V}(x_0, U)$ boils down to solving a (somewhat large) linear system: $$ \left(\mathbf{G}^\top {\mathbf{\tilde{Q}}} \mathbf{G} + \mathbf{\tilde{R}}\right) U^\star = - G^\top \mathbf{\tilde{Q}} H x_0 \; . $$ This can be easier said than done; indeed, the size of this linear system grows linearly with $N$ – hence solving it will cost us roughly $\mathcal{O}(N^3)$ operations, which quickly becomes prohibitive.
Dynamic Programming
When we think about it, we have only described so far some finite horizon MDP. It is therefore only reasonable to give Dynamic Programing (DP) a shot, hoping for a more efficient solution.
In the following, for any $x\in\mathbb{X}$ and $t \in [T]$ let: $$ \mathcal{V}^\star_t(x) := \min_{u_{_t}, \ldots, u_{_{T-1}}} \sum_{k=t}^{T-1} \left\{x_k^\top Q_k x_k + u_k^\top R_k u_k \right\} + x_T^\top Q_T x_T\; \quad \text{ with } x_t = x\;, $$ the value function associated to the $t$-tail sub-problem. One of the LQR fundamental result is that each $\mathcal{V}^\star_t(\cdot)$ is a quadratic function which coefficient can be computed in a backward fashion. Formally, for any $t\in[T]$ we have $ \mathcal{V}^\star_t(x) = x^\top P_t x $, where: $$ \begin{aligned} P_T &= Q_T \succeq 0 \\ \text{ for }t<T, \; P_t &= Q_t + A^\top P_{t+1}A - A^\top P_{t+1}B\left(R_t + B^\top P_{t+1} B\right)^{-1} B^\top P_{t+1}A \succeq 0\;. \end{aligned} $$
But what good is this property when it comes to find an optimal control sequence? It turns out that the sequence $(u_1^\star, \ldots, u_{N-1}^\star)$ where each $u_t^\star = (R_t + B^\top P_{t+1} B)^{-1}B^\top P_{t+1}Ax_t$ solves the Dynamic Programming objective (cf. proof above) is optimal. In other words, playing: $$u_t = K_t x_t \quad \text{ where }\quad K_t = -(R_t + B^\top P_{t+1} B)^{-1}B^\top P_{t+1}A $$ at every round $t\in[T]$ yields an optimal behavior.
In the previous approach, we computed optimal controls $(u_1^\star, \ldots, u_{N-1}^\star)$ before-hand. This is called an open loop approach. We just found out that each $u_t^\star$ can be expressed directly as a function of $x_t$. We can therefore wait to observe the state $x_t$ before deciding on our command (closed loop). This is an action vs. strategy distinction. In deterministic systems the two are equivalent; however, as we’ll see shortly, in stochastic settings the open loop approach can be arbitrarily sub-optimal.
Finally, observe how the computation cost’s is now being trimmed down to $\mathcal{O}(T)$ – one of DP main’s advantage, along with its portability to stochastic settings.
Steady-state problem
Let’s now part with the finite horizon problem and consider steady state settings. The idea is to build stabilizing control ($x_t \to 0$), evaluated under an infinite horizon. We now evaluate a control sequence $U = (u_0, \ldots, u_t, \ldots)$ under the following criterion: $$ \mathcal{V}_\infty(x_0, U) = \sum_{t=0}^\infty x_t Q x + u_t R u_t \; . $$ Note that we know consider fixed state and control cost matrices. Observe also that $\mathcal{V}_\infty(x_0, U)$ can be $+\infty$ for some control. For our minimization problem to be well-defined, we’ll need to ensure that $\mathcal{V}^\infty(x_0, U)$ is bounded above for at least one control sequence. To this end, the main control theoretic notion we will need is called controllability. The pair $(A, B)$ is said to be controllable if: $$ \text{rank}\begin{bmatrix} B & AB & \ldots & A^{d_x-1}B \end{bmatrix}= d_x \; . $$ Controllability is sufficient to guarantee for any $x_0$, it exists $T\in\mathbb{N}$ and $U = (u_0, \ldots, u_{T-1})$ such that $x_T = 0$ – ensuring that we can indeed find a control sequence such that $\mathcal{V}_\infty(x_0, U)<+\infty$.
Now, it makes sense to look for an optimal control. Inspired from the finite-horizon setting (and the theory of average cost MDPs) we will seek for a solution to the following Bellman optimality equation: $$ \mathcal{V}_\infty^\star(x) = x^\top Q x + \min_u\Big\{ u\top R u + \mathcal{V}_\infty^\star(Ax + Bu)\Big\} \quad \text{ for any } x\in\mathbb{R}^{d_x}\; . $$
It is easy to check that $\mathcal{V}_\infty^\star(x) = x^\top P x$ is such a solution (as a matter of fact, it is the only one), where the matrix $P\succeq 0$ verifies the so-called Discrete Algebraic Ricatti Equation (DARE): $$ P = Q + A^\top P A - A^\top P B (R + B^\top P B)^{-1}B^\top P A \; . $$ It can be shown that if $(A, B)$ is controllable, the DARE has a unique p.s.d solution. The optimal control strategy is now a stationary one, and it writes: $$ u_t = K x_t \quad \text{ where } \quad K = -(R + B^\top P B)^{-1}B^\top P A\; . $$
Stochastic LQR
We can now have a quick glance at the stochastic case. The dynamics are now upset by some noisy disturbance, assumed to follow a fixed Gaussian distribution: $$ x_{t+1} = A x_t + B u_t + \omega_t, \quad \text{ where } \omega_t \overset{\text{i.i.d}}{\sim} \mathcal{N}(0, \Sigma) \; . $$
For simplicity, let’s go back to the finite-horizon objective. Instead of looking for an open-loop control $U = (u_0, \ldots, u_{T-1})$ we now explicitly look for good strategies – or policies $\Pi$, made up of a succesion of mapping $\pi_t:\mathcal{X} \mapsto \mathcal{U}$. An optimal policy $\Pi^\star = (\pi^\star_0, \ldots, \pi^\star_{T-1}) $ checks: $$ \mathcal{V}_{\Pi^\star}(x) = \min_\Pi \mathbb{E}\left[\sum_{t=0}^{T-1} \{ x_t Q x_t + u_t^\top R u_t \} + x_T^\top Q x_T \middle\vert x_0; u_t = \pi_t(x_t) \right] = \mathcal{V}^\star(x) \;, $$ where $x_{t+1} = A x_t + B u_t + \omega_t$ and the expectation is taken over all the realisation of the disturbances $\{\omega_t\}_t$.
A rather surprising (at least it was for me) is that for the LQR, the optimal policy for stochastic setting is the same as for deterministic one! This property is known as certainty equivalence, and as far as I know, is a singularity of the LQR. To see why it holds, we can repeat the demonstration we had in the deterministic case. Let’s recursively define value functions as follows: $\mathcal{V}_T^\star = x^\top P_T x + q_T$ where $q_T = 0$, and: $$ \mathcal{V}^\star_t(x) := x^\top Q x + \min_{u}\mathbb{E}\left[ \mathcal{V}^\star_{t+1}(Ax + Bu + \omega_t) \right] + q_{t+1} \; . $$ We won’t go through the all induction process, but assuming that $\mathcal{V}^\star_{t+1}(x) = x^\top P_{t+1} x + q_{t+1}$ we will obtain by using the fact that $\mathbb{E}[\omega_t] = 0$ that:
$$ \begin{aligned} \mathcal{V}^\star_t(x) &\overset{(i)}{=} x^\top Q x + x^\top A^\top P_{t+1} A x + \min_{u}\left\{ u^\top B^\top P_{t+1}Bu + 2u^\top B^\top P_{t+1}A x\right\} + \mathbb{E}\left[\omega^\top P_{t+1}\omega\right] + q_{t+1} \; ,\\ &= x^\top\left( Q + A^\top P_{t+1} A - A^\top P_{t+1}B(R + B^\top P_{t+1}B)^{-1} B^\top P_{t+1}A\right) x + \mathbb{E}\left[\omega^\top P_{t+1}\omega\right] + q_{t+1}\; . \end{aligned} $$ where in $(i)$ the optimal control is given by no other than $u_t^\star = -(B^\top P_t B + R_t)^{-1}B^\top P_{t} A x$! By some simple linear algrebra: $$ \mathbb{E}\left[\omega^\top P_{t+1}\omega\right] = P_{t+1}\text{Tr}\left(\mathbb{E}\left[\omega\omega^\top\right]\right) = P_{t+1}\text{Tr}\left(\Sigma\right)\; . $$ Hence $\mathcal{V}^\star_t(x) = x^\top P_t x + q_t$ where $P_t$ is defined recursively as in the deterministic case, and $q_t = q_{t+1} + P_{t+1}\text{Tr}\left(\Sigma\right)$.
In a few words, stochasticity changes the actual value of the value fonctions, but not the optimal strategy, which still writes: $$ \pi_t^\star(x) = K_t x \; . $$