Oldies but goodies: MLE

Maximum-Likelihood Estimation (MLE) is ubiquitous in modern (supervised) machine learning, to the point that we sometimes forget it is the very principle behind the criterion we use to train fancy models. Justifying the MLE’s current popularity from a statistical perspective only would be a stretch; nonetheless, it is useful to remember that it enjoys some comfortable (asymptotic) statistical properties. This post provides proof for two such attributes: consistency and asymptotic normality.

Maximum-Likelihood Estimation (MLE) is a generic paradigm for estimating unknown parametric distributions from a collection of samples. Under mild regularity conditions, it yields consistent estimators: as the number of samples goes to infinity, we are guaranteed to properly identify the unknown distribution. This property mirrors the Law of Large Numbers (LLN); another important attribute of MLE echoes the Central Limit Theorem (CLT) and guarantees that MLE estimators are asymptotically normal. This result is comforting but barely useful in itself; yet, we will see the variance of said normal random variables allows us to prove that the MLE is asymptotically efficient as it matches the Cramér-Rao bound.

Setting

$\quad$ To keep things simple, we will restrict our short study to real-valued random variables. The claimed results easily extend to discrete random variables (basically, replacing integrals by sums).

Let’s quickly go through the measure-theoretic set-up: let $(\Omega, \mathcal{F}, \mathbb{P})$ a probability space and $X : \Omega \mapsto \mathbb{R}$ a $\mathcal{F}/\mathcal{B}(\mathbb{R})$ measurable function (in plain words, our real-valued random variable). We let $\mathbb{P}_{\theta_\star}$ the probability law of $X$, assumed absolutely continuous w.r.t the Lebesgue measure $\nu$. It is indexed by a fixed but unknown parameter ${\theta_\star}$; we only assume that $\theta\in\Theta$ where $\Theta$ is a known, compact subset of $\mathbb{R}$. For any $\theta\in\Theta$ we will denote $p_\theta := d\mathbb{P}_{\theta}/d\nu$ the density of the law indexed by $\theta$.

We will assume that we have access to a set of independent realisations $\mathbf{X}_n := \{X_1, \ldots, X_n \} \sim \mathbb{P}_{\theta_\star}^n$. The (normalised) log-likelihood of the dataset $\mathbf{X}_n$ under an hypothesis $\mathbb{P}_\theta$ will be denoted $\mathcal{L}_n(\theta)$: $$ \begin{aligned} \mathcal{L}_n(\theta) :&= \frac{1}{n}\log \frac{d\mathcal{P}_\theta^n}{d\nu}(\mathbf{X}_n) \;, &\\ &= \frac{1}{n}\log \prod_{i=1}^n p_\theta(X_i) \;,& (\text{independence}) \\ &= \frac{1}{n}\sum_{i=1}^n \log p_\theta(X_i) \;. \end{aligned} $$ The Maximum Likelihood Estimator is defined as the maximiser of the log-likelihood. (We will soon impose conditions for this definition to make sense, i.e. that an argmax exists and is unique).

$$ \theta_n := \argmax_{\theta\in\Theta} \mathcal{L}_n(\theta) \; . $$

$\quad$ As a maximiser of some statistic of $\mathbf{X}_n$, the MLE is a special case of an extremum estimator.

Consistency

The MLE is consistent: as our dataset $\mathbf{X}_n$ grows ($n\to\infty$) the estimator $\theta_n$ converge (in some sense) to the unknown $\theta_\star$. Below, we will show that this convergence happens in probability $\theta_n \overset{\text{p}}{\to} \theta_\star$:

$qquad \qquad \quad\;\text{Under some regularity assumptions (see 1-4 below):}\\$ $$ \forall\varepsilon>0,\; \lim_{n\to\infty}\mathbb{P}(\vert \theta_n - \theta_\star\vert \geq \varepsilon) = 0\; . $$
Consistency

This is quite comforting: as we collect more data, we can expect that in the limit we’d recover $\theta_\star$. The intuition behind proof is quite straight-forward. It essentially relies on the fact that the expected log-likelihood $ \mathrm{L}(\theta) := \mathbb{E}_{p_{\theta_\star}}\left[\log p_\theta(X)\right] $ is maximised by $\theta_\star$, and that by the (weak) law of large numbers $\mathcal{L}_n(\theta) \overset{\text{p}}{\to} \mathrm{L}(\theta)$ for any $\theta$. Since $\theta_n$ maximises $\mathcal{L}_n(\theta)$, it is expected that it converges to $\theta_\star$.

Note

We can now work on making this intuition formal. That $ \mathrm{L}(\theta_\star) = \max_{\theta} \mathrm{L}(\theta) $ can easily be proved using Jensen inequality, without any specific assumptions.

Proof

That, for any $\theta$, we have $\mathcal{L}_n(\theta) \overset{\text{p}}{\to} \mathrm{L}(\theta)$ is a straight-forward of the weak LLN. However, to guarantee the convergence of the argmax, we will need a little extra help that will come from the uniform law of large number. We will adopt a set of sufficient (but not necessary) set of regularity assumptions to ensure it holds.

  1. Identifiability: $\theta_1 \neq \theta_2 \Longleftrightarrow p_{\theta_1} \neq p_{\theta_2}$,
  2. Regularity: $\forall x\in\mathbb{R}, \; \theta \mapsto p_\theta(x)$ is continuous,
  3. Compactness: $\Theta$ is compact subset of $\mathbb{R}$,
  4. Dominance: it exists an integrable $D : \mathbb{R}\mapsto \mathbb{R}$ such that $\log p_\theta(x) < D(x)$ for any $\theta$ in $\Theta$.

Identifiability is necessary to ensure that convergence of distributions implies convergence in parameter. The regularity and compactness assumption typically ensures that $\theta_n$ is safely defined as the maximiser of $\mathcal{L}_,$ (i.e. a maximiser exists). The dominance assumption is key to ensure we obtain a uniform law of large number. Actually, let’s state this result straight away.

$\qquad \qquad \qquad \qquad\qquad \text{Under assumptions 1-4, for any } \varepsilon>0:$ $$ \sup_{\theta\in\Theta} \vert \mathcal{L}_n(\theta) - \mathrm{L}(\theta)\vert \overset{\text{p}}{\to} 0\; . $$
Uniform (weak) LLN

This essentially extends the weak LLN for the convergence to happen uniformy in $\theta$. The result is immediate if $\Theta$ is finite, thanks to the LLN and a simple union bound. When $\Theta$ has infinite cardinality, we fall back to the finite setting thanks to covering argument.

Proof

We now have the tools to show convergence of the argmax – that is, $\theta_n$ for $\mathcal{L}_n$ towards $\theta_\star$ for $\mathrm{L}$. Indeed, for any $\varepsilon>0$, using the uniform convergence easily yields that: $$ \tag{1} \lim_{n\to\infty} \mathbb{P}\left(\vert \mathcal{L}_n(\theta_n) - \mathrm{L}(\theta_\star)\vert \geq \varepsilon\right) = 0 \; . $$ In others words, we showed that $\mathcal{L}_n(\theta_n) \overset{\text{p}}{\to} \mathrm{L}(\theta)$.

Proof

Finally, observe that by the triangular inequality we have:

$$ \vert \mathrm{L}(\theta_n) - \mathrm{L}(\theta_\star) \vert \leq \vert \mathrm{L}(\theta_n) - \mathcal{L}_n(\theta_n) \vert + \vert \mathcal{L}_n(\theta_n) - \mathrm{L}(\theta_\star) \vert\; , $$ where $\vert \mathcal{L}_n(\theta_n) - \mathrm{L}(\theta_\star) \vert \overset{\text{p}}{\to} 0$ thanks to (1) and $\vert \mathrm{L}(\theta_n) - \mathcal{L}_n(\theta_n) \vert \overset{\text{p}}{\to}0$ by uniform convergence. Hence $ \mathrm{L}(\theta_n) \overset{\text{p}}{\to} \mathrm{L}(\theta_\star) $. Thanks to identifiability assumptions 1., we proved that $\theta_n$ converges in probability to $\theta_\star$.

Asymptotic Normality

So far, under some regularity assumptions, we showed that $\theta_n \overset{\text{p}}{\to} \theta_\star$ – a property mimimic the (weak) LLN. It is natural to wonder “how fast” this convergence happens, at least asymptotically. In other words, we are asking whether a Central Limit Theorem (CLT) holds by the MLE. Good news, it does! Below, we denote: $$ \mathcal{I}(\theta_\star) := \mathbb{E}_{\theta_\star}\left[\left(\left.\frac{d}{d\theta}\log p_\theta(X)\right\vert_{\theta_\star}\right)^2\right]\;, $$ a quantity known as the Fisher information.

$\qquad \qquad \qquad\qquad \qquad \;\, \text{Under some regularity assumptions (see below):}\\$ $$ \sqrt{n}(\theta_n-\theta_\star) \overset{\text{d}}{\longrightarrow} \mathcal{N}(0, \mathcal{I}(\theta_\star)^{-1})\; . $$
Asymptotic Normality

In a CLT-like fashion, $\sqrt{n}(\theta_n-\theta_\star)$ is asymptotically normally distributed. The variance of said normal distribution is of particular interest. Indeed, the Cramer-Rao bound states that any unbiased estimator $\theta^\prime$ of $\theta_\star$ checks $ \mathbb{V}\text{ar}(\theta^\prime) \geq \mathcal{I}(\theta_\star)^{-1} $. We just claimed that asymptotically, $\theta_n$ matches this lower-bound: it is therefore the unbiased estimator with smallest variance. Such an estimator is sometimes called asymptotically efficient.

To prove the asymptotical normality of the MLE, we will assume that:

  1. The assumptions 1-4 hold,
  2. For any $x$, the function $\theta\mapsto p_\theta(x)$ is twice continuously differentiable.

Before jumping into the proofs, it is useful to recall some important properties. First, the expectation of the score function is always 0: $$ \tag{2} \mathbb{E}_{\theta_\star}\left[\left.\frac{d}{d\theta}\log p_{\theta}(X)\right\vert_{\theta_\star}\right] = 0 \;. $$ Further, under 2., the following identity holds: $$ \tag{3} \mathcal{I}(\theta_\star) = \mathbb{E}_\theta\left[\left.\frac{d^2}{d\theta}\log p_{\theta}(X)\right\vert_{\theta_\star}\right]\; . $$

Proof

$\theta_n$ being the argmax of $\mathcal{L}_n(\theta)$, we have $\left.\frac{d}{d\theta} \mathcal{L}_n(\theta)\right\vert_{\theta_n}=0$ (assuming it lies within $\Theta$). By an exact Taylor expansion, and adopting more convenient notations: $$ \begin{aligned} 0 &= \mathcal{L}_n^\prime(\theta_\star) + \mathcal{L}_n^{\prime\prime}(\tilde\theta_n)(\theta_n - \theta_\star)\; , \\ &\Longrightarrow \theta_n - \theta_\star = -\mathcal{L}_n^\prime(\theta_\star) / \mathcal{L}_n^{\prime\prime}(\tilde\theta_n) \; , \end{aligned} $$ where $\tilde\theta_n\in[\theta_n, \theta_\star]$. Observe that: $$ \begin{aligned} \mathcal{L}_n^\prime(\theta_\star) &= \mathcal{L}_n^\prime(\theta_\star) - \mathrm{L}(\theta_\star)\;, &(\mathrm{L}(\theta_\star) = 0) \\ &= \frac{1}{n}\sum_{i=1}^n \frac{d}{d\theta}\log p_\theta(X_i)\vert_{\theta_\star} - \mathbb{E}_{\theta_\star}\left[\frac{d}{d\theta}\log p_\theta(X)\right] \;. \\ \end{aligned} $$ The CLT applies, guaranteeing that $\mathcal{L}_n^\prime(\theta_\star) \overset{\text{d}}{\to} \mathcal{N}(0, \mathcal{I}(\theta_\star))$.

Proof

Finally, by consistency of $\theta_n$, we have the $\tilde\theta_n \overset{\text{p}}{\to} \theta_\star$. Using another uniform convergence argument (similar to the one introduced to demonstrate consistency) one can easily show that: $$ \mathcal{L}_n^{\prime\prime}(\tilde\theta_n) \overset{\text{p}}{\to} \mathbb{E}_\theta\left[\left.\frac{d^2}{d\theta}\log p_{\theta}(X)\right\vert_{\theta_\star}\right] = \mathcal{I}(\theta_\star)\;, $$ thanks to (3). Putting our findings together: we showed that $\theta_n - \theta_\star = -\mathcal{L}_n^\prime(\theta_\star) / \mathcal{L}_n^{\prime\prime}(\tilde\theta_n)$, where $\mathcal{L}_n^\prime(\theta_\star) \overset{\text{d}}{\to} \mathcal{N}(0, \mathcal{I}(\theta_\star))$ and $\mathcal{L}_n^{\prime\prime}(\tilde\theta_n) \overset{\text{p}}{\to} \mathcal{I}(\theta_\star)$; a simple application of Stutsky’s lemma yields the announced claim: $$ \theta_n - \theta_\star \overset{\text{d}}{\to} \mathcal{N}(0, \mathcal{I}(\theta_\star)^{-1})\; . $$