Diffusion Models from the Ground Up

We all have heard about diffusion models and played with their generative magic. To me, it felt awkward not properly understanding how they work, from first principles. After all, they are just another latent variable model, right? This post dives into the probabilistic foundations that make them tick, breaking down the tools needed to re-derive variational inference for diffusion models.

The term “diffusion model” coins a family of latent variable models containing, e.g., diffusion probabilistic models [1] , noise-conditioned score functions [2] and denoising diffusion probabilistic model [3] . In this post we will focus on the latter–it builds on top of the first and has close ties with the second.

Warm-up

There is nothing fundamental tying diffusion models to the normal distribution—however, in practice, people often use the normal distribution as an elementary building block to a diffusion model (we will soon see how). We will follow that path here, and first go through some well-known identities related to Gaussian density multiplication and convolution. You surely have seen those results countless times before; we provide the proofs (which are undergraduate course all-time classics) for the sake of completeness. For the rest of this section, let $p(x) = \mathcal{N}(x\vert \mu_1, \sigma_1^2)$ and $q(x) = \mathcal{N}(x\vert \mu_2, \sigma_2^2)$ two univariate normal densities.

$\qquad\qquad\qquad\qquad\quad\text{Gaussian distributions are stable by multiplication and convolution.}\\$ $\text{}\\$ $\text{1. The product of two Gaussian densities is proportional to another Gaussian density:}\\$ $$ \tag{1} p(x) q(x) \propto \mathcal{N}(x\vert \nu, \tau^2) \text{ where }\nu = (\mu_1\sigma_2^2 + \mu_2\sigma_1^2)/(\sigma_1^2 + \sigma_2^2) \text{ and } \tau^2 = \sigma_1^2\sigma_2^2/(\sigma_1^2 + \sigma_2^2). $$ $\text{}\\$ $\text{2. The convolution between the two Gaussian densities is still Gaussian:}\\$ $$ \tag{2} \int_{t} p(t)q(x-t)dt = \mathcal{N}(x\vert \bar\mu, \bar\sigma^2) \text{ where }\bar\mu = \mu_1 + \mu_2 \text{ and } \bar\sigma^2 = \sigma_1 + \sigma_2. $$

Gaussian identities

▶ Proof

The proof of 1. is a classical exercise of the so-called “complete the square” routine. $$ \begin{aligned} p(x)q(x) &\propto \exp\left(-\frac{1}{2\sigma_1^2}(x-\mu_1)^2 - \frac{1}{2\sigma_2^2}(x-\mu_2)^2\right) \;,\\ &= \exp\left(-\frac{1}{2\sigma_1^2}(x^2-2x\mu_1) - \frac{1}{2\sigma_2^2}(x^2-2x\mu_2)^2\right) \exp\left(-\frac{\mu_1^2}{2\sigma_1^2}-\frac{\mu_2^2}{2\sigma_2^2}\right) \;,\\ &= \exp\left(-\frac{\sigma_1^2 + \sigma_2^2}{2\sigma_1^2\sigma_2^2}\left[x^2 - 2x\frac{\mu_1\sigma_2^2 + \mu_2\sigma_1^2}{\sigma_1^2+\sigma_2^2}\right]\right) \exp\left(-\frac{\mu_1^2}{2\sigma_1^2}-\frac{\mu_2^2}{2\sigma_2^2}\right) \;,\\ &= \exp\left(-\frac{\sigma_1^2 + \sigma_2^2}{2\sigma_1^2\sigma_2^2}\left[x - \frac{\mu_1\sigma_2^2 + \mu_2\sigma_1^2}{\sigma_1^2+\sigma_2^2}\right]^2\right) \exp\left(\frac{\sigma_1^2 + \sigma_2^2}{2\sigma_1^2\sigma_2^2}\frac{(\mu_1\sigma_2^2 + \mu_2\sigma_1^2)^2}{(\sigma_1^2+\sigma_2^2)^2}\right)\exp\left(-\frac{\mu_1^2}{2\sigma_1^2}-\frac{\mu_2^2}{2\sigma_2^2}\right) \;, \\ &= \underbrace{\exp\left(-\frac{\sigma_1^2 + \sigma_2^2}{2\sigma_1^2\sigma_2^2}\left[x - \frac{\mu_1\sigma_2^2 + \mu_2\sigma_1^2}{\sigma_1^2+\sigma_2^2}\right]^2\right)}_{\mathcal{N}(x\vert \nu, \tau^2)} \exp\left(\frac{(\mu_1\sigma_2^2 + \mu_2\sigma_1^2)^2}{2\sigma_1^2\sigma_2^2(\sigma_1^2+\sigma_2^2)}\right)\exp\left(-\frac{\mu_1^2}{2\sigma_1^2}-\frac{\mu_2^2}{2\sigma_2^2}\right)\; . \end{aligned} $$ The proof of 2. is immediate when realising that the convolution can be mapped to summing two Gaussian random variables. We will nonetheless carry on with the brute force approach adopted for the previous proof. First, observe that:

$$ \begin{aligned} \int_{t} p(t)q(t)dt &= \frac{1}{2\pi\sigma_1\sigma_2}\exp\left(\frac{(\mu_1\sigma_2^2 + \mu_2\sigma_1^2)^2}{2\sigma_1^2\sigma_2^2(\sigma_1^2+\sigma_2^2)}\right)\exp\left(-\frac{\mu_1^2}{2\sigma_1^2}-\frac{\mu_2^2}{2\sigma_2^2}\right) \int_{x} \mathcal{N}(t\vert \nu, \tau^2)dx \;, \\ &\overset{(i)}{=} \frac{1}{\sqrt{2\pi}\sqrt{\sigma_1^2+ \sigma_2^2}}\exp\left(\frac{(\mu_1\sigma_2^2 + \mu_2\sigma_1^2)^2}{2\sigma_1^2\sigma_2^2(\sigma_1^2+\sigma_2^2)}\right)\exp\left(-\frac{\mu_1^2}{2\sigma_1^2}-\frac{\mu_2^2}{2\sigma_2^2}\right) \;, \\ &= \frac{1}{\sqrt{2\pi}\sqrt{\sigma_1^2+ \sigma_2^2}}\exp\left(\frac{1}{2(\sigma_1^2 + \sigma_2^2)}\left[\frac{(\mu_1\sigma_2^2 + \mu_2\sigma_1^2)^2}{\sigma_1^2\sigma_2^2}-\frac{\mu_1^2}{\sigma_1^2}(\sigma_1^2+\sigma_2^2) - \frac{\mu_2^2}{\sigma_2^2}(\sigma_1^2+\sigma_2^2)\right]\right) \;, \\ &= \frac{1}{\sqrt{2\pi}\sqrt{\sigma_1^2+ \sigma_2^2}}\exp\left(\frac{1}{2(\sigma_1^2 + \sigma_2^2)}\left(\mu_1-\mu_2\right)^2\right) \;. \end{aligned} $$ In (1) we used the fact that $\int_{x} \mathcal{N}(x\vert \nu, \tau^2)dx = \sqrt{2\pi}\tau$. To obtain the result of the convolution, observe that $q(x-t)=\mathcal{N}(t\vert x-\mu, \sigma_2^2)$. Hence, we obtain that for the convolution: $$ \begin{aligned} \int_{t} p(t)q(x-t)dt &= \frac{1}{\sqrt{2\pi}\sqrt{\sigma_1^2+ \sigma_2^2}}\exp\left(\frac{1}{2(\sigma_1^2 + \sigma_2^2)}\left(\mu_1-(x-\mu_2)\right)^2\right) \;, \\ &= \frac{1}{\sqrt{2\pi}\sqrt{\sigma_1^2+ \sigma_2^2}}\exp\left(\frac{1}{2(\sigma_1^2 + \sigma_2^2)}\left(x-(\mu_1+\mu_2)\right)^2\right) \;, \\ &= \mathcal{N}(x\vert \bar\mu, \bar\sigma^2)\; . \end{aligned} $$

$\blacksquare$

The Model

Let $p$ be some target distribution that we wish to model through a parametric form $p_\theta$. For simplicity, we will assume here that $p$ is supported on $\mathbb{R}^d$ for some $d\in\mathbb{N}$. Let $x_0 \in\mathbb{R}^d$ a sample from $p$. At heart, a diffusion probabilistic model is a latent variable model of the form: $$ p_{\theta}(x_0) = \int_{x_{1:T}} p_\theta(x_{0:T})dx_{1:T}\;, $$ for some $T\in\mathbb{N}$ , and where we used the short-hand $x_{1:T} = (x_1, \ldots, x_T)$. A salient feature of the diffusion probabilistic model is that the latent $x_{1:T}$ live in the same space $\mathbb{R}^d$ as $x_0$. Further, a diffusion model assumes some temporal dependencies in the latent space via so-called forward and backward generative processes.

Forward process

The forward process $x_0 \rightarrow x_1 \rightarrow \ldots \rightarrow x_T$ is defined as a fixed, non-stationary Markov Chain which simply adds noise gradually. Formally, for $\{\beta_t\}_{t\geq 1}$ a sequence of scalars in $(0, 1]$: $$ q(x_{1:T}\vert x_0) = \prod_{t=1}^T q_t(x_t\vert x_{t-1})\quad \text{ with } \quad q_t(x_t\vert x_{t-1}) = \mathcal{N}\left(x_t \vert \sqrt{1-\beta_t}x_{t-1}, \beta_t \mathbf{I}_d\right)\; . $$

Readers used to latent variable models can already anticipate that $q$ will be the variational distribution used for learning $p_\theta$. The approximated posterior $q$ is therefore fixed and has no learnable parameters.

$\quad$ In all generality, the parameters $\{\beta_t\}_t$ are learnable, and could be optimised via re-parametrisation. We ignore this below and do not consider them as degrees of freedom.

ⓘ

Backward process

The backward process $x_T \rightarrow x_{T-1} \rightarrow x_0$ is defined as a Markov Chain with learnable Gaussian dynamics with mean $\mu_\theta(x_t, t)$ and covariance matrix $ \Sigma_\theta(x_t, t)$. Concretely, with $p_\theta(x_T) = \mathcal{N}(x_T\vert 0, \mathbf{I}_d)$:

$$ p_\theta(x_{0:T}) = p_\theta(x_T)\prod_{t=1}^T p_\theta(x_{t-1}\vert x_t)\quad \text{with} \quad p_{\theta}(x_{t-1}\vert x_t) = \mathcal{N}(x_{t-1}\vert \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))\; . $$ One can generate a new sample from $p_\theta$ (hopefully approximating $p$) by sampling $x_T$ from a centered noise, simply by walking down the backward process.

$\qquad \text{Both the forward and backward process are non-stationary, and we should write } p_{\theta}(x_{t-1}\vert x_t, t) \text{ instead of}\\$ $ p_{\theta}(x_{t-1}\vert x_t)\text{ -- we do not in an effort to reduce clutter. However, observe that we keep this nuance for the prior} \\$ $\text{mean and variance where we keep a time dependency, e.g. } \mu_\theta(x_t, t).$

⚠

Useful results

We collect here a few results that will prove useful down the road. Let $\alpha_t = \prod_{s=1}^t (1-\beta_s)$. The forward process can be summarised as follows, for $t\in\{1, \ldots, T\}$: $$ \tag{3} q(x_t \vert x_0) = \mathcal{N}(x_t \vert \sqrt{\alpha_t} x_0, (1-\alpha_t)\mathbf{I}_d)\; . $$

▶ Proof

The conditional $q(x_{t-1}\vert x_t)$ is intractable, but $q(x_{t-1}\vert x_t, x_0)$ has a closed-form! Indeed, $q(x_{t-1}\vert x_t, x_0) = \mathcal{N}\left(x_{t-1}\vert \mu_{t}, \sigma_t^2\right)$ where:

$$ \tag{4} \begin{aligned} \mu_t &= \frac{\sqrt{1-\beta_t}(1-\alpha_{t-1})}{1-\alpha_t}x_t + \frac{\sqrt{\alpha_{t-1}}\beta_t}{1-\alpha_t}x_0\;, \\ \sigma_t^2 &= \frac{1-\alpha_{t-1}}{1-\alpha_{t}}\beta_t\; . \end{aligned} $$

▶ Proof

Learning

Ideally, the model would be learned via maximum-likelihood; given samples $\{x_0^1, \ldots, x_0^n\}$ we want to compute $$ \theta_n \in \argmax_\theta \sum_{i=1}^n \log p_\theta(x_0^i)\;. $$ Sadly, the log-likelihood $\log p_\theta$ is intractable. Instead, we will maximise a variational lower-bound.

Variational bound

The variational lower-bound is classically obtained by introducing a variational distribution and using Jensen’s inequality. Let’s detail each step along the way: $$ \begin{aligned} \log p_\theta(x_0) &= \log \int_{x_{1:T}} p_\theta(x_{0:T})d x_{1:T} \;, \\ &= \log \int_{x_{1:T}} \frac{p_\theta(x_{0:T})}{q(x_{1:T}\vert x_0)}q(x_{1:T}\vert x_0)d x_{1:T}\;, \\ &\geq \int_{x_{1:T}} q(x_{1:T}\vert x_0)\log\frac{p_\theta(x_{0:T})}{q(x_{1:T}\vert x_0)}d x_{1:T}\;, &(\text{concavity of }\log)\\ &= \int_{x_{1:T}} q(x_{1:T}\vert x_0) \log \frac{p_\theta(x_T)\prod_{t=1}^T p_\theta(x_{t-1}\vert x_t)}{\prod_{t=1}^Tq(x_t\vert x_{t-1})}dx_{1:T}\;,\\ &= \mathbb{E}_{q(\cdot\vert x_0)}\left[\log p_\theta(x_T) + \sum_{t=1}^T \log\frac{p_\theta(x_{t-1}\vert x_t)}{q(x_t\vert x_{t-1})}\right]\; . \end{aligned} $$ At this point, we are now ready to use (4) so to introduce the reverse $q(x_{t-1}\vert x_t, x_0)$. It would of course be simpler to introduce $q(x_{t-1}\vert x_t)$ instead, however we cannot have a closed form for the latter. Observe that by Bayes rule we have $q(x_t\vert x_{t-1}) = q(x_t\vert x_{t-1}, x_0) = q(x_{t-1}\vert x_t, x_0)q(x_t\vert x_0) / q(x_{t-1}\vert x_0)$. Hence: $$ \begin{aligned} \log p_\theta(x_0) &\geq \mathbb{E}_{q(\cdot\vert x_0)}\left[\log p_\theta(x_T) + \sum_{t=1}^T \log\frac{p_\theta(x_{t-1}\vert x_t)}{q(x_{t-1}\vert x_{t}, x_0)} + \sum_{t=1}^T \log\frac{q(x_{t-1}\vert x_0)}{q(x_{t}\vert x_0)}\right]\; , \\ &= \mathbb{E}_{q(\cdot\vert x_0)}\left[\sum_{t=1}^T \log\frac{p_\theta(x_{t-1}\vert x_t)}{q(x_{t-1}\vert x_{t}, x_0)} + \log\frac{p_\theta(x_T)}{q(x_{T}\vert x_0)}\right]\; . &(\text{telescopic sum})\\ &= \mathbb{E}_{q(\cdot\vert x_0)}\left[\sum_{t=1}^T \mathbb{E}_{x_{t-1}\sim q(x_{t-1}\vert x_{t}, x_0)}\left[\log\frac{p_\theta(x_{t-1}\vert x_t)}{q(x_{t-1}\vert x_{t}, x_0)}\right]\right] + \mathbb{E}_{x_T\sim q(\cdot\vert x_0)}\left[\log\frac{p_\theta(x_T)}{q(x_T\vert x_0)}\right]\;. &(\text{tower rule}) \end{aligned} $$ proving the following variational lower-bound:

$\qquad\qquad\qquad\qquad\qquad\qquad \text{For any } x_0 \text{ let:}\\$ $$ \tag{5} \mathcal{VB}_\theta(x_0) := \mathbb{E}_{q(\cdot\vert x_0)}\left[\sum_{t=1}^T \text{KL}\left(q(x_{t-1}\vert x_{t}, x_0)\,\vert\vert\, p_\theta(x_{t-1}\vert x_t, t)\right) + \text{KL}\left(q(x_T\vert x_0)\,\vert\vert\, p_\theta(x_T)\right)\right]\;. $$ $\text{Then } \log p_\theta(x_0) \geq \mathcal{VB}_\theta(x_0).$

Variational lower-bound

Observe how all terms making up $\mathcal{VB}_\theta(x_0)$ involve KL-divergences for Gaussian distributions – hence can be computed in closed form, while the expectation can be approximated via Monte-Carlo sampling. As a lower-bound to our initial true objective, we now switch our attention to maximising $\mathcal{VB}_\theta(x_0)$ as a proxy. Observe that while maximising $\mathcal{VB}_\theta(x_0)$, we can drop the term $\text{KL}\left(q(x_T\vert x_0)\,\vert\vert\, p_\theta(x_T)\right)$, as it does not capture any learning parameter. We are now ready to state the actual objective we optimise to train a diffusion model. Before that, we follow [2] and simplify the prior covariance to $\Sigma_\theta(x_t, t) = \tilde\sigma_t^2\mathbf{I}_d$ where $\tilde\sigma_t^2$ is a fixed parameter. Our new learning objective now writes:

$$ \tag{6} \hat\theta_n \in \argmax_\theta \sum_{i=1}^n \mathbb{E}_{q(\cdot\vert x_0^i)}\left[\sum_{t=1}^T \frac{1}{\sigma_t^2}\|\mu_t^i - \mu_\theta(x_t^i, t)\|_2^2\right]\; , $$ where $\mu_t$ and $\sigma_t$ are defined in (4) – $\mu_t^i$ being the notation we used for the $\text{i}^\text{th}$ sample, from $x_0^i$.

▶ Proof

Re-parametrisation

Recall the expression for $\mu_t$ established in (4): $$ \mu_t = \frac{\sqrt{1-\beta_t}(1-\alpha_{t-1})}{1-\alpha_t}x_t + \frac{\sqrt{\alpha_{t-1}}\beta_t}{1-\alpha_t}x_0\; . $$ This is a valid target to regress via our model $\mu_\theta(x_t, t)$ for the prior mean. However, together with (5), we obtain some intuition that we could slightly change what we model in order to ease learning. Indeed, observe that from (3) one can re-parametrise $x_t$ as: $$ \begin{aligned} x_t &= \sqrt{\alpha_t}x_0 + \sqrt{1-\alpha_t}\varepsilon\;,\\ \tag{7} \Longleftrightarrow x_0 &= \frac{x_t}{\sqrt{\alpha_t}} - \frac{\sqrt{1-\alpha_t}}{\sqrt{\alpha_t}}\varepsilon \;, \end{aligned} $$ for any $t\geq 0$, with $\varepsilon \sim \mathcal{N}(0, \mathbf{I}_d)$. Plugging this back in (4) we obtain: $$ \mu_t = \frac{1}{\sqrt{1-\beta_t}}\left(x_t + \frac{\beta_t}{\sqrt{1-\alpha_t}}\varepsilon\right)\;. $$

▶ Proof

Observe how we remove the dependency from $x_0$ to introduce directly a sampled noise random variable. This prompts for adding some structure to $\mu_\theta(x_t, t)$. Instead of regressing directly $\mu_t$, it now makes sense to write: $$ \mu_\theta(x_t, t) = \frac{1}{\sqrt{1-\beta_t}}\left(x_t + \frac{\beta_t}{\sqrt{1-\alpha_t}}\varepsilon_\theta(x_t, t)\right)\;, $$ where $\varepsilon_\theta(x_t, t):\mathbb{R}^d\mapsto\mathbb{R}^d$, thanks to (6), can be learned through $$ \begin{aligned} \hat\theta_n \in &\argmax_\theta \sum_{i=1}^n \mathbb{E}_{q(\cdot\vert x_0^i)}\left[\sum_{t=1}^T \frac{\beta_t^2}{(1-\beta_t)(1-\alpha_t)}\|\varepsilon^i - \varepsilon_\theta(x_t^i, t)\|_2^2\right]\; ,\\ \tag{8} &= \argmax_\theta \sum_{i=1}^n \mathbb{E}_{q(\cdot\vert x_0^i)}\left[\sum_{t=1}^T \frac{\beta_t^2}{(1-\beta_t)(1-\alpha_t)}\|\varepsilon^i - \varepsilon_\theta(\sqrt{\alpha_t}x_0^i + \sqrt{1-\alpha_t}\varepsilon^i, t)\|_2^2\right]\; .\\ \end{aligned} $$ The intuition behind the method now emerges. In short, we are training a model to reconstruct the noise $\varepsilon^i$ that was used by the forward process to create $x_t^i$ from $x_0^i$. The weights $\frac{\beta_t^2}{(1-\beta_t)(1-\alpha_t)}$ prioritise reconstruction for early stages ($t\ll 1$) over later stages $(t\approx T)$.

$\quad$ [2] reports the weighting, in an image generation context, has little effect on the final performance.

ⓘ

Wrapping up

We are now ready to wrap-up this deep dive into the probabilistic foundations of diffusion models. A training objective can be derived by building a variational lower-bound on the maximum likelihood. By re-parametrisation, learning a model to inverse some Gaussian noise arises as a valid, efficient design. A stochastic gradient implementation of the learning protocol is detailed below, for illustration.

$\textbf{init } \text{data } \{x_0^i\}_{i=1}^n, \text{ horizon } T, \text{ sequence } \{\beta_{t}\}_{t=1}^T, \text{ parameter } \theta, \text{ learning rate } \gamma\;.\\$ $\textbf{while } \text{not converged}\\$ $\quad\text{sample } i\in\{1\ldots n\},\\$ $\quad\text{sample } t\in\{1, \ldots, T\},\\$ $\quad\text{sample } \varepsilon\sim\mathcal{N}(0, \mathbf{I}_d)\\$ $\quad \theta\leftarrow \theta - \gamma \frac{\beta_t^2}{(1-\beta_t)(1-\alpha_t)}\nabla_\theta\|\varepsilon - \varepsilon_\theta(\sqrt{\alpha_t}x_0^i + \sqrt{1-\alpha_t}\varepsilon, t)\|_2^2\\$ $\textbf{end while}\\$

$\texttt{Training}$

Below is also detailed the protocol for sampling from $p_\theta$, by walking down the backward process. It is directly obtained by the specified parametrisation.

$\textbf{init } \text{ horizon } T, \text{ sequence } \{\beta_{t}\}_{t=1}^T\;.\\$ $\text{sample } x_T\sim\mathcal{N}(0, \mathbf{I}_d)\\$ $\textbf{for } t=T, \ldots, 1\\$ $$ x_{t-1} = \frac{1}{\sqrt{1-\beta_t}}\left(x_t + \frac{\beta_t}{\sqrt{1-\alpha_t}}\varepsilon_\theta(x_t, t)\right)\;. $$ $\textbf{end for}\\$ $\textbf{return } x_0$

$\texttt{Sampling}$

References

[1] Sohl-Dickstein & al, 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics.

[2] Song & Ermon, 2019. Generative modeling by estimating gradients of the data distribution.

[3] Ho & al, 2020. Denoising diffusion probabilistic models.