Diffusion Model

Overview

Diffusion models consist of two process
- (Fixed) Forward diffusion process that gradually adds noise to input
- (Learnt / Generative) Reverse denoising process that learns to generate data by denoising

The goal of a diffusion model is to learn the reverse denoising process to iteratively undo the forward process

In this way, the reverse process appears as if it is generating new data from random noise!

Assumptions:
- Ancestral sampling: data distribution can be represented as a series of increasingly noisy versions, with the original data being the "clean" version. This allows them to learn the process of progressively denoising the data.
- Gaussian noise: The noise added during the forward diffusion is generally assumed to be Gaussian, with a mean of 0 and a gradually increasing variance
- Markov property: forward diffusion process is assumed to follow a Markov property, meaning the transition probability at any step depends only on the previous step and not on the entire history.

Forward Process / diffusion process

Goal

Gradually adds Gaussian noise to the data by Markov Chain

Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. The distribution at a particular time step only depends on the sample from the immediate previous step

Why Markov Chain can be used?
- The forward process is designed such that each image $x_t$ is generated by adding a small amount of noise to the previous image $x_{t-1}$ . In other words, $x_t \sim q(x_t \mid x_{t-1})$ , meaning that this process only depends on the immediately preceding state and not on any earlier states. This fully aligns with the definition of a Markov chain: “the distribution of the current state depends only on the previous state.” Therefore, it is appropriate to model this process as a Markov chain, since the forward diffusion is inherently a locally dependent stochastic process—by design, it is Markovian.

Why using Markov Chain?
- Simplified modeling and mathematical derivation: Instead of modeling complex global noise all at once, the problem is broken down into a sequence of small incremental changes. Each step only needs to learn how to “add a small amount of noise,” making the modeling and training process more tractable. Mathematically, this also facilitates the derivation of a complete probabilistic model, enabling the use of tools such as KL divergence for optimization.
- Facilitates the derivation and sampling of the reverse process: The denoising process can be constructed step by step in reverse, with each step theoretically grounded. At each stage, the denoising prediction only needs to rely on the current noisy state, without requiring knowledge of the full noise trajectory.

Why using Gaussian noise?
- The Gaussian distribution is the most fundamental, simple, and controllable distribution among continuous probability distributions. If Gaussian noise is continuously added to an image, the distribution of the image will eventually converge to the standard normal distribution $\mathcal{N}(0, I)$ . This gives the forward process a clear and simple “endpoint.” In other words, using Gaussian noise allows us to know exactly what the process will converge to—pure noise—and this final distribution is easy to sample from.
- The sum of Gaussian noises is still Gaussian, which makes mathematical derivations more tractable. Since the process remains a composition of Gaussian distributions over multiple steps, we can directly derive the probability distribution between any two time steps—for example, the distribution of noise at step 100 given the original image.
- Gaussian noise is highly compatible with the denoising task: In image processing, “Gaussian denoising” is a well-established problem, and many models have been developed to handle it. Therefore, training a model to predict how to recover an image from Gaussian noise is both practically feasible and relatively easy to converge.
- The Gaussian distribution is differentiable and has closed-form expressions: For tasks such as taking derivatives, performing maximum likelihood estimation, and optimizing loss functions, the Gaussian distribution is highly convenient. Its probability density function (PDF) and KL divergence have analytical closed-form solutions, making computation efficient and stable.

Get approximate posterior $q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)$

q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right):=\prod_{t=1}^T q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)

Given $\mathbf{x}_0$ , the probability of generate the sequence $\mathbf{x}_{1:T} = \{ \mathbf{x}_1, \mathbf{x}_2, …, \mathbf{x}_T \}$

According to Markov Chain and the chain rule of probability
$\begin{aligned} q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) &= q(\mathbf{x}_1 \mid \mathbf{x}_0) \cdot q(\mathbf{x}_2 \mid \mathbf{x}_1, \mathbf{x}_0) \cdots q(\mathbf{x}_T \mid \mathbf{x}_{T-1}, \ldots, \mathbf{x}_0) \\ &= q(\mathbf{x}_1 \mid \mathbf{x}_0) \cdot q(\mathbf{x}_2 \mid \mathbf{x}_1) \cdots q(\mathbf{x}_T \mid \mathbf{x}_{T-1}) \\ &= \prod_{t=1}^{T} q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) \end{aligned}$

Assume noise follow Gaussian distribution

q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right):=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}\right)

Each transition is parameterised as a diagonal Gaussian
- diagonal Gaussian
  - Each dimension is independent (since the covariances are zero)；
  - Each dimension has its own variance $\sigma_i^2$ ；
  - There is no linear correlation between different dimensions.
- Why diagonal Gaussian
  - Simplified computation: The covariance matrix is diagonal, making sampling and likelihood calculation easier.
  - High computational efficiency: No need to store or manipulate complex matrices.
  - Sufficient flexibility: For many tasks, the independence assumption is enough to generate high-quality samples.
  - More stable training.
- By adding Gaussian noise, the transition probability at each step, $q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)$ can be defined as a conditional Gaussian distribution.

Variance (schedule): $\beta_1 ,...,\beta_t$
- $\beta_t$ are hyperparameters, follows a fix schedule
- noise schedule: how much noise you are adding at each step
- variane matrix: $\beta_t \mathbf{I}$ (Assume the noise is isotropic, i.e., it has the same variance in every dimension.)
- $\beta$ increase with time: $\beta_1 < \beta_2 <... <\beta_t$
- $\{{\beta_t \in (0, 1)}\}_{t=1}^T$ → $\sqrt{1-\beta_t} \in (0,1)$ → mean of each of Gaussian close to 0

Mean: $\sqrt{1-\beta_t} \mathbf{x}_{t-1}$
- Use the data from the previous step as the basis for scaling.
- mean of each of Gaussian close to 0

$T$ : timestep
$\lim_{T \to \infty} q(\mathbf{x}_T \mid \mathbf{x}_0) = \mathcal{N}(\mathbf{0}, \mathbf{I})$
- In practical, $T$ is thousands, $T$ set to be large → set $\beta_t$ to be very small
- Why small $\beta_t$ ? so the reverse won't be too difficult
- As time step increases, the more feature of original inputs were destroyed
- $T \to \infty$ , pure random noise, isotropic gaussian $q\left(\mathbf{x}_T \mid \mathbf{x}_0\right) = \mathcal{N}(0, I)$

Tricks for efficient training

Trick is basically showing that you can directly express $\mathbf{x}_t$ in terms of $\mathbf{x}_0$ , which means you can directly go from initial image to a noisy version at given timestep → simplify all the math

the form of $q(\mathbf{x}_t \mid \mathbf{x}_{0})$ can be recursively derived through repeated applications of the reparameterization trick.
$\begin{aligned} \text{Define:} \quad & \alpha_t = 1 - \beta_t, \quad \bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i \\ \text{Then:} \quad & q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\left( \sqrt{1 - \beta_t} \, \mathbf{x}_{t-1}, \beta_t \mathbf{I} \right) \\ & \mathbf{x}_t = \sqrt{1 - \beta_t} \, \mathbf{x}_{t-1} + \sqrt{\beta_t} \, \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ & = \sqrt{\alpha_t} \, \mathbf{x}_{t-1} + \sqrt{1 - \alpha_t} \, \boldsymbol{\epsilon} \\ & = \sqrt{\alpha_t \alpha_{t-1}} \, \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \, \boldsymbol{\epsilon} \\ & \quad \vdots \\ & = \sqrt{\bar{\alpha}_t} \, \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \, \boldsymbol{\epsilon} \\ \text{Therefore:} \quad & q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N} \left( \sqrt{\bar{\alpha}_t} \, \mathbf{x}_0, \left( 1 - \bar{\alpha}_t \right) \mathbf{I} \right) \end{aligned}$

Suppose that we have access to $2T$ random noise variables $\{ \boldsymbol{\epsilon}^*, \boldsymbol{\epsilon}_t \}_{t=0}^{T}$ {t∗, t}tT=0 iid ∼ N (; 0, I). Then, for an arbitrary sample $\mathbf{x}_t \sim q(\mathbf{x}_t \mid \mathbf{x}_0)$ , we can rewrite it as:
$\begin{aligned}x_t &= \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon_{t-1}^* \\&= \sqrt{\alpha_t} \left( \sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-2}^* \right) + \sqrt{1 - \alpha_t} \epsilon_{t-1}^* \\&= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t (1 - \alpha_{t-1})} \epsilon_{t-2}^* + \sqrt{1 - \alpha_t} \epsilon_{t-1}^* \\&= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t - \alpha_t \alpha_{t-1} + 1 - \alpha_t} \epsilon_{t-2} \\&= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilon_{t-2} \\&\quad \vdots \\&= \sqrt{\prod_{i=1}^{t} \alpha_i} x_0 + \sqrt{1 - \prod_{i=1}^{t} \alpha_i} \epsilon_0 \\&= \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon_0 \\&\sim \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) \mathbf{I})\end{aligned}$

Diffused Data Distribution $q\left(\mathbf{x}_{t} \right)$

$q(\mathbf{x}_t) = \int q(\mathbf{x}_0, \mathbf{x}_t) \, d\mathbf{x}_0 = \int q(\mathbf{x}_0) \, q(\mathbf{x}_t \mid \mathbf{x}_0) \, d\mathbf{x}_0$

$q(\mathbf{x}_t)$ : diffused data distribution

$q(\mathbf{x}_0, \mathbf{x}_t)$ : joint distribution

$q(\mathbf{x}_0)$ : input data distribution

$q(\mathbf{x}_t \mid \mathbf{x}_0)$ : diffusion kernel: Gaussian convolution

Reverse Process / denoising process

Goal

make a image with pure noise and denoise until back to the original image

learn the reverse denoising process to iteratively undo the forward process

The reverse process appears as if it generating new data from random noise

Diffused Data Distribution $q\left(\mathbf{x}_{t} \right)$

In generation, we follow:
- Sample $\mathbf{x}_T \sim \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I})$
- Iteratively sample $\mathbf{x}_{t-1} \sim q(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$

$q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)$ is unknown

distribution of $\mathbf{x}_{t-1}$ conditioned on $\mathbf{x}_{t}$ is hard to find cuz some terms in the math is hard/intractable to compute because you have to know what the distribution is like exactly
$\begin{gathered}f(\theta \mid x)=\frac{f(\theta, x)}{f(x)}=\frac{f(\theta) f(x \mid \theta)}{f(x)} \Longrightarrow q\left(x_{t-1} \mid x_t\right)=q\left(x_t \mid x_{t-1}\right) \frac{q\left(x_{t-1}\right)}{q\left(x_t\right)} \\ q\left(x_t\right)=\int q\left(x_t \mid x_{t-1}\right) q\left(x_{t-1}\right) \mathrm{d} x\end{gathered}$

computing is computationally intractable

Approximate the reverse process / learn to calculate the $q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)$

Several ways to try:
- math equation
- plot the 5 examples

For small enough forward steps, the reverse process step $q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)$ can be estimated as Gaussian Distribution too (derived from stochastic differential equation), therefore we have:

\begin{aligned} p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) &= \mathcal{N} \left( \mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) \right) \\ \text{such that} \quad p_\theta(\mathbf{x}_{0:T}) &= p(\mathbf{x}_T) \prod_{t=1}^{T} p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) \\ \text{with} \quad p(\mathbf{x}_T) &= \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I}) \end{aligned}

This gaussian(defined by mean and variance) learn from NN

Optimisation

ELBO: Evidence Lower Bound

Objective: recover the likelihood of purely our observed data $p(x)$

Method 1
- obtain distribution of observed data $\boldsymbol{x}$ by explicitly marginalize out the latent variable $\boldsymbol{z}$
  $p(x)=∫p(x,z)dz$
  - Directly computing and maximizing the likelihood $p(\boldsymbol{x})$ is difficult because it involves integrating out all latent variables $\boldsymbol{z}$ in Equation, which is intractable for complex models.
- $\begin{aligned}\log p(\boldsymbol{x}) &= \log \int p(\boldsymbol{x}, \boldsymbol{z}) \, d\boldsymbol{z} && \text{(Apply Equation 1)} \\&= \log \int \frac{p(\boldsymbol{x}, \boldsymbol{z}) \, q_\phi(\boldsymbol{z}|\boldsymbol{x})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \, d\boldsymbol{z} && \text{(Multiply by 1 = } \frac{q_\phi(\boldsymbol{z}|\boldsymbol{x})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \text{)} \\&= \log \, \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \right] && \text{(Definition of Expectation)} \\&\geq \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \right] && \text{(Apply Jensen's Inequality)}\end{aligned}$
  - However, this does not supply us much useful information about what is actually going on underneath the hood; crucially,
    - this proof gives no intuition on exactly why the ELBO is actually a lower bound of the evidence, as Jensen’s Inequality handwaves it away.
    - simply knowing that the ELBO is truly a lower bound of the data does not really tell us why we want to maximize it as an objective. i.e. why optimizing the ELBO is an appropriate objective at all.

Method 2
- obtain distribution of observed data $\boldsymbol{x}$ by chain rule of probability
  $p(\boldsymbol{x})=\frac{p(\boldsymbol{x}, \boldsymbol{z})}{p(\boldsymbol{z} \mid \boldsymbol{x})}$
  - Directly computing and maximizing the likelihood $p(\boldsymbol{x})$ is difficult because it involves having access to a ground truth latent encoder ${p(\boldsymbol{z} \mid \boldsymbol{x})}$ in Equation.
  - better understand the relationship between the evidence and the ELBO
    $\begin{aligned}\log p(\boldsymbol{x}) &= \log p(\boldsymbol{x}) \int q_\phi(\boldsymbol{z}|\boldsymbol{x}) \, d\boldsymbol{z} && \text{(Multiply by 1 = } \int q_\phi(\boldsymbol{z}|\boldsymbol{x}) \, d\boldsymbol{z} \text{)} \\&= \int q_\phi(\boldsymbol{z}|\boldsymbol{x}) (\log p(\boldsymbol{x})) \, d\boldsymbol{z} && \text{(Bring evidence into integral)} \\&= \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} [\log p(\boldsymbol{x})] && \text{(Definition of Expectation)} \\&= \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{p(\boldsymbol{z}|\boldsymbol{x})} \right] && \text{(Apply Equation 2)} \\&= \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \log \frac{p(\boldsymbol{x}, \boldsymbol{z}) q_\phi(\boldsymbol{z}|\boldsymbol{x})}{p(\boldsymbol{z}|\boldsymbol{x}) q_\phi(\boldsymbol{z}|\boldsymbol{x})} \right] && \text{(Multiply by 1 = } \frac{q_\phi(\boldsymbol{z}|\boldsymbol{x})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \text{)} \\&= \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \right] + \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \log \frac{q_\phi(\boldsymbol{z}|\boldsymbol{x})}{p(\boldsymbol{z}|\boldsymbol{x})} \right] && \text{(Split the Expectation)} \\&= \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \right] + D_{KL}(q_\phi(\boldsymbol{z}|\boldsymbol{x}) \parallel p(\boldsymbol{z}|\boldsymbol{x})) && \text{(Definition of KL Divergence)} \\&\geq \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \right] && \text{(KL Divergence always } \geq 0)\end{aligned}$
  - evidence is equal to the ELBO plus the KL Divergence (which is non-negative) between the approximate posterior $q_\phi(\boldsymbol{z}|\boldsymbol{x})$ and the true posterior $p(\boldsymbol{z}|\boldsymbol{x})$ . → the value of the ELBO can never exceed the evidence.
  - why we seek to maximize the ELBO
    - want to optimize the parameters of our variational posterior qφ(z|x) to exactly match the true posterior distribution $q_\phi(\boldsymbol{z}|\boldsymbol{x})$ , which is achieved by minimizing their KL Divergence (ideally to zero). Unfortunately, it is intractable to minimize this KL Divergence term directly, as we do not have access to the ground truth $p(\boldsymbol{z}|\boldsymbol{x})$ distribution.
    - notice that on the left hand side of Equation 15, the likelihood of our data (and therefore our evidence term $p(\boldsymbol{x})$ ) is always a constant with respect to $\phi$ , as it is computed by marginalizing out all latents $\boldsymbol{x}$ from the joint distribution $p(\boldsymbol{x}, \boldsymbol{z})$ and does not depend on $\phi$ whatsoever.
    - Since the ELBO and KL Divergence terms sum up to a constant, any maximization of the ELBO term with respect to $\phi$ necessarily invokes an equal minimization of the KL Divergence term. Thus, the ELBO can be maximized as a proxy for learning how to perfectly model the true latent posterior distribution; the more we optimize the ELBO, the closer our approximate posterior gets to the true posterior.
    - Additionally, once trained, the ELBO can be used to estimate the likelihood of observed or generated data as well, since it is learned to approximate the model evidence log $p(\boldsymbol{x})$ .

$q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})$
- is a flexible approximate variational distribution with parameters $\phi$ that we seek to optimize.
- can be thought of as a parameterizable model that is learned to estimate the true distribution over latent variables for given observations $\boldsymbol{x}$
- it seeks to approximate true posterior ${p(\boldsymbol{z} \mid \boldsymbol{x})}$

$ELBO =\mathbb{E}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\right]$
- increase the lower bound by tuning the parameters $\phi$ to maximize the ELBO, we gain access to components that can be used to model the true data distribution and sample from it, thus learning a generative model.
- ELBO becomes a proxy objective with which to optimize a latent variable model;

Variational Autoencoders

\begin{aligned}\mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \right] &= \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \log \frac{p_\theta(\boldsymbol{x}|\boldsymbol{z}) p(\boldsymbol{z})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \right] && \text{(Chain Rule of Probability)}\\&= \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} [\log p_\theta(\boldsymbol{x}|\boldsymbol{z})] + \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \log \frac{p(\boldsymbol{z})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \right] && \text{(Split the Expectation)}\\&= \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} [\log p_\theta(\boldsymbol{x}|\boldsymbol{z})] - D_{KL}(q_\phi(\boldsymbol{z}|\boldsymbol{x}) \parallel p(\boldsymbol{z})) && \text{(Definition of KL Divergence)} \\&= \underbrace{\mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} [\log p_\theta(\boldsymbol{x}|\boldsymbol{z})]}_{\text{reconstruction term}} - \underbrace{D_{KL}(q_\phi(\boldsymbol{z}|\boldsymbol{x}) \parallel p(\boldsymbol{z}))}_{\text{prior matching term}}\end{aligned}

Encoder $q_{\boldsymbol{\phi}}(\boldsymbol{z} | \boldsymbol{x})$ : transforms inputs into a distribution over possible latents

Decoder $p_{\boldsymbol{\theta}}(\boldsymbol{x} | \boldsymbol{z})$ : a deterministic function that convert a given latent vector $\boldsymbol{z}$ into an observation $\boldsymbol{x}$

Reconstruction Term
- this ensures that the learned distribution is modeling effective latents that the original data can be regenerated from

Prior matching term
- measures how similar the learned variational distribution is to a prior belief held over latent variables.
- Minimizing this term encourages the encoder to actually learn a distribution rather than collapse into a Dirac delta function.

** Maximizing the ELBO is thus equivalent to maximizing its first term and minimizing its second term.

ELBO optimized jointly over parameters $\phi$ and $\theta$ in VAE

The VAE utilizes the reparameterization trick and Monte Carlo estimates to optimize the ELBO jointly over $\phi$ and $\theta$ .

Monte Carlo estimates
- The encoder of the VAE is commonly chosen to model a multivariate Gaussian with diagonal covariance, and the prior is often selected to be a standard multivariate Gaussian
  $q_{\phi}(z \mid x) = \mathcal{N}(z; \mu_{\phi}(x), \sigma_{\phi}^2(x) \mathbf{I})\\p(z) = \mathcal{N}(z; 0, \mathbf{I})$
- Then
  $\arg\max_{\phi, \theta} \mathbb{E}_{q_{\phi}(z \mid x)} \left[ \log p_{\theta}(x \mid z) \right] - D_{\text{KL}}(q_{\phi}(z \mid x) \parallel p(z))\approx \arg\max_{\phi, \theta} \sum_{l=1}^{L} \log p_{\theta}(x \mid z^{(l)}) - D_{\text{KL}}(q_{\phi}(z \mid x) \parallel p(z))$
  - the KL divergence term of the ELBO can be computed analytically
  - the reconstruction term can be approximated using a Monte Carlo estimate.

reparameterization trick
$x \sim \mathcal{N}(\mu, \sigma^2) \rightarrow x = \mu + \sigma \epsilon \quad \text{with} \quad \epsilon \sim \mathcal{N}(\epsilon; 0, \mathbf{I})$
- Why use reparameterizarion trick
  - each $z^{(l)}$ that our loss is computed on is generated by a stochastic sampling procedure, which is generally non-differentiable.
- what is reparameterization trick
  - rewrites a random variable as a deterministic function of a noise variable;
  - allows for the optimization of the non-stochastic terms through gradient descent.
  - arbitrary Gaussian distributions can be interpreted as standard Gaussians (of which is a sample) that have their mean shifted from zero to the target mean $\mu$ by addition, and their variance stretched by the target variance $\sigma^2$
  - by the reparameterization trick, sampling from an arbitrary Gaussian distribution can be performed by sampling from a standard Gaussian, scaling the result by the target standard deviation, and shifting it by the target mean.
- use reparameterization trick in ELBO of VAE
  $z = \mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon \quad \text{with} \quad \epsilon \sim \mathcal{N}(\epsilon; 0, \mathbf{I})$
  - represents an element-wise product
  - Under this reparameterized version of $z$ , gradients can then be computed with respect to $\phi$ as desired, to optimize $\mu \phi$ and $\sigma \phi$ .
- Variational Autoencoders are particularly interesting when the dimensionality of $z$ is less than that of input $x$ , as we might then be learning compact, useful representations.
- Furthermore, when a semantically meaningful latent space is learned, latent vectors can be edited before being passed to the decoder to more precisely control the data generated.

Hierarchical Variational Autoencoders

Markovian HVAE (MHVAE)
$p(x, \mathbf{z}_{1:T}) = p(\mathbf{z}_T) p_{\theta}(x \mid \mathbf{z}_1) \prod_{t=2}^{T} p_{\theta}(\mathbf{z}_{t-1} \mid \mathbf{z}_t)\\q_{\phi}(\mathbf{z}_{1:T} \mid x) = q_{\phi}(\mathbf{z}_1 \mid x) \prod_{t=2}^{T} q_{\phi}(\mathbf{z}_t \mid \mathbf{z}_{t-1})$
- the generative process is a Markov chain
- each transition down the hierarchy is Markovian
- decoding each latent $\mathbf{z}_t$ only conditions on previous latent $\mathbf{z}_{t+1}$
- simply stacking VAEs on top of each other / a Recursive VAE

ELBO in MHVAE
$\begin{aligned}\log p(x) &= \log \int p(x, \mathbf{z}_{1:T}) \, d\mathbf{z}_{1:T} && \text{(Apply Equation 1)} \\&= \log \int \frac{p(x, \mathbf{z}_{1:T}) q_{\phi}(\mathbf{z}_{1:T} \mid x)}{q_{\phi}(\mathbf{z}_{1:T} \mid x)} \, d\mathbf{z}_{1:T} && \text{(Multiply by 1 = } \frac{q_{\phi}(\mathbf{z}_{1:T} \mid x)}{q_{\phi}(\mathbf{z}_{1:T} \mid x)}\text{)} \\&= \log \mathbb{E}_{q_{\phi}(\mathbf{z}_{1:T} \mid x)} \left[ \frac{p(x, \mathbf{z}_{1:T})}{q_{\phi}(\mathbf{z}_{1:T} \mid x)} \right] && \text{(Definition of Expectation)} \\&\geq \mathbb{E}_{q_{\phi}(\mathbf{z}_{1:T} \mid x)} \left[ \log \frac{p(x, \mathbf{z}_{1:T})}{q_{\phi}(\mathbf{z}_{1:T} \mid x)} \right] && \text{(Apply Jensen's Inequality)}\end{aligned}$
$\mathbb{E}{q_{\phi}(\mathbf{z}_{1:T} \mid x)} \left[ \log \frac{p(x, \mathbf{z}_{1:T})}{q_{\phi}(\mathbf{z}_{1:T} \mid x)} \right] = \mathbb{E}{q_{\phi}(\mathbf{z}_{1:T} \mid x)} \left[ \log \frac{p(\mathbf{z}_T) p_{\theta}(x \mid \mathbf{z}_1) \prod{t=2}^{T} p_{\theta}(\mathbf{z}_{t-1} \mid \mathbf{z}_t)}{q_{\phi}(\mathbf{z}_1 \mid x) \prod{t=2}^{T} q_{\phi}(\mathbf{z}_t \mid \mathbf{z}_{t-1})} \right]$

Variational Diffusion Models

Diffusion model V.S. Markovian Hierarchical VAE
- The easiest way to think of a Variational Diffusion Model (VDM) is simply as a MHVAE with three key restrictions:
  - The latent dimension is exactly equal to the data dimension → represent both true data samples and latent variables as $\boldsymbol{x}_t$
  - The structure of the latent encoder at each timestep is not learned; it is pre-defined as a linear Gaussian model. In other words, it is a Gaussian distribution centered around the output of the previous timestep.
  - The Gaussian parameters of the latent encoders vary over time in such a way that the distribution of the latent at final timestep T is a standard Gaussian.

ELBO in Variational Diffusion Model
$\begin{aligned}& \log p(\boldsymbol{x})=\log \int p\left(\boldsymbol{x}_{0: T}\right) d \boldsymbol{x}_{1: T} \\& =\log \int \frac{p\left(\boldsymbol{x}_{0: T}\right) q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)} d \boldsymbol{x}_{1: T} \\& =\log \mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\frac{p\left(\boldsymbol{x}_{0: T}\right)}{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\right] \\& \geq \mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_{0: T}\right)}{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\right] \\& =\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right) \prod_{t=1}^T p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t\right)}{\prod_{t=1}^T q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\& =\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right) p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right) \prod_{t=2}^T p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t\right)}{q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right) \prod_{t=1}^{T-1} q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\& =\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right) p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right) \prod_{t=1}^{T-1} p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t+1}\right)}{q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right) \prod_{t=1}^{T-1} q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\& =\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right) p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right)}{q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right)}\right]+\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \prod_{t=1}^{T-1} \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t+1}\right)}{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\& =\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right)\right]+\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right)}{q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right)}\right]+\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\sum_{t=1}^{T-1} \log \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t+1}\right)}{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\& =\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right)\right]+\mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right)}{q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right)}\right]+\sum_{t=1}^{T-1} \mathbb{E}_{q\left(\boldsymbol{x}_{1: T} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t+1}\right)}{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\& =\mathbb{E}_{q\left(\boldsymbol{x}_1 \mid \boldsymbol{x}_0\right)}\left[\log p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right)\right]+\mathbb{E}_{q\left(\boldsymbol{x}_{T-1}, \boldsymbol{x}_T \mid \boldsymbol{x}_0\right)}\left[\log \frac{p\left(\boldsymbol{x}_T\right)}{q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right)}\right]+\sum_{t=1}^{T-1} \mathbb{E}_{q\left(\boldsymbol{x}_{t-1}, \boldsymbol{x}_t, \boldsymbol{x}_{t+1} \mid \boldsymbol{x}_0\right)}\left[\log \frac{p_{\boldsymbol{\theta}}\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t+1}\right)}{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right)}\right] \\& =\underbrace{\mathbb{E}_{q\left(\boldsymbol{x}_1 \mid \boldsymbol{x}_0\right)}\left[\log p_\theta\left(\boldsymbol{x}_0 \mid \boldsymbol{x}_1\right)\right]}_{\text {reconstruction term }}-\underbrace{\mathbb{E}_{q\left(\boldsymbol{x}_{T-1} \mid \boldsymbol{x}_0\right)}\left[D_{\mathrm{KL}}\left(q\left(\boldsymbol{x}_T \mid \boldsymbol{x}_{T-1}\right) \| p\left(\boldsymbol{x}_T\right)\right)\right]}_{\text {prior matching term }} \\& -\sum_{t=1}^{T-1} \underbrace{\mathbb{E}_{q\left(\boldsymbol{x}_{t-1}, \boldsymbol{x}_{t+1} \mid \boldsymbol{x}_0\right)}\left[D_{\mathrm{KL}}\left(q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}\right) \| p_\theta\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t+1}\right)\right)\right]}_{\text {consistency term }}\end{aligned}$
- reconstruction term
- prior matching term:
  - minimized when the final latent distribution matched the Gaussian prior.
  - This term requires no optimization
  - as we assume the large enough T, such that final distribution is Gaussian, this term turns zero immediately
- consistency term
  - minimized when we train $p_\theta(x_t|x_{t+1})$ to match the Gaussian distribution $q(x_t|x_{t-1})$

Approximate ELBO with lower variance
- From previous equations, all terms of ELBO are calculated as expectations, therefore can be approximated by Monte Carlo estimates.
- However, actually optimizing the ELBO using the terms we just derived might be suboptimal; because the consistency term is computed as an expectation over two random variables $\{x_{t−1}, x_{t+1}\}$ for every timestep, the variance of its Monte Carlo estimate could potentially be higher than a term that is estimated using only one random variable per timestep.
- Thus try yo derive an expectation over only one random variable.

Diffusion
- From Makov property and Bayes rule, we have
  $q(x_t|x_{t-1}) = q(x_t|x_{t-1}, x_0) \\q(x_t \mid x_{t-1}, x_0) = \frac{q(x_{t-1} \mid x_t, x_0) q(x_t \mid x_0)}{q(x_{t-1} \mid x_0)}$
- Then we have the new form of ELBO:
  $\begin{aligned} \log p(\mathbf{x}) &\geq \mathbb{E}{q(\mathbf{x}{1:T}|\mathbf{x}0)} \left[ \log \frac{p(\mathbf{x}{0:T})}{q(\mathbf{x}{1:T}|\mathbf{x}0)} \right] \\&= \mathbb{E}{q(\mathbf{x}{1:T}|\mathbf{x}0)} \left[ \log \frac{p(\mathbf{x}T) \prod{t=1}^T p\theta(\mathbf{x}{t-1}|\mathbf{x}t)}{\prod{t=1}^T q(\mathbf{x}t|\mathbf{x}{t-1})} \right] \\ &= \mathbb{E}{q(\mathbf{x}_{1:T}|\mathbf{x}0)} \left[ \log \frac{p(\mathbf{x}T) p\theta(\mathbf{x}0|\mathbf{x}1) \prod{t=2}^T p\theta(\mathbf{x}{t-1}|\mathbf{x}t)}{q(\mathbf{x}1|\mathbf{x}0) \prod{t=2}^T q(\mathbf{x}t|\mathbf{x}{t-1})} \right] \\ &= \mathbb{E}{q(\mathbf{x}{1:T}|\mathbf{x}0)} \left[ \log \frac{p(\mathbf{x}T) p\theta(\mathbf{x}0|\mathbf{x}1) \prod{t=2}^T p\theta(\mathbf{x}{t-1}|\mathbf{x}_t)}{q(\mathbf{x}1|\mathbf{x}0) \prod{t=2}^T q(\mathbf{x}t|\mathbf{x}{t-1}, \mathbf{x}0)} \right] \\ &= \mathbb{E}{q(\mathbf{x}{1:T}|\mathbf{x}0)} \left[ \log \frac{p\theta(\mathbf{x}T) p\theta(\mathbf{x}_0|\mathbf{x}1)}{q(\mathbf{x}1|\mathbf{x}0)} + \log \prod{t=2}^T \frac{p\theta(\mathbf{x}{t-1}|\mathbf{x}t)}{q(\mathbf{x}t|\mathbf{x}{t-1}, \mathbf{x}0)} \right] \\ &= \mathbb{E}{q(\mathbf{x}{1:T}|\mathbf{x}0)} \left[ \log \frac{p\theta(\mathbf{x}T) p\theta(\mathbf{x}_0|\mathbf{x}1)}{q(\mathbf{x}1|\mathbf{x}0)} + \log \prod{t=2}^T \frac{p\theta(\mathbf{x}{t-1}|\mathbf{x}t)}{\frac{q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) q(\mathbf{x}_t|\mathbf{x}0)}{q(\mathbf{x}{t-1}|\mathbf{x}_0)}} \right] \\&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_T) p_\theta(\mathbf{x}_0|\mathbf{x}_1)}{q(\mathbf{x}_1|\mathbf{x}_0)} + \log \prod_{t=2}^T \frac{p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)}{\frac{q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) q(\mathbf{x}_t|\mathbf{x}_0)}{q(\mathbf{x}_{t-1}|\mathbf{x}_0)}} \right] \\&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_T) p_\theta(\mathbf{x}_0|\mathbf{x}_1)}{q(\mathbf{x}_1|\mathbf{x}_0)} + \log \frac{q(\mathbf{x}_1|\mathbf{x}_0)}{q(\mathbf{x}_T|\mathbf{x}_0)} + \log \prod_{t=2}^T \frac{p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)} \right] \\&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_T) p_\theta(\mathbf{x}_0|\mathbf{x}_1)}{q(\mathbf{x}_T|\mathbf{x}_0)} + \sum_{t=2}^T \log \frac{p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)} \right] \\&= \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \left[ \log p_\theta(\mathbf{x}_0|\mathbf{x}_1) \right] + \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_T)}{q(\mathbf{x}_T|\mathbf{x}_0)} \right] + \sum_{t=2}^T \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)} \right] \\&= \mathbb{E}_{q(\mathbf{x}_1|\mathbf{x}_0)} \left[ \log p_\theta(\mathbf{x}_0|\mathbf{x}_1) \right] + \mathbb{E}_{q(\mathbf{x}_T|\mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_T)}{q(\mathbf{x}_T|\mathbf{x}_0)} \right] + \sum_{t=2}^T \mathbb{E}_{q(\mathbf{x}_t, \mathbf{x}_{t-1}|\mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)} \right] \\&= \mathbb{E}_{q(\mathbf{x}_1|\mathbf{x}_0)} \underbrace{\left[ \log p_\theta(\mathbf{x}_0|\mathbf{x}_1) \right]}_{\text{reconstruction term}} - \underbrace{D_{\mathrm{KL}}(q(\mathbf{x}_T|\mathbf{x}_0) \,\|\, p(\mathbf{x}_T))}_{\text{prior matching term}} - \sum_{t=2}^T \underbrace{\mathbb{E}_{q(\mathbf{x}_t|\mathbf{x}_0)} \left[ D_{\mathrm{KL}}(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) \,\|\, p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)) \right]}_{\text{denoising matching term}} \end{aligned}$
- reconstruction term: approximated using Monte Carlo estimate
- prior matching term: zero under our assumption
- denoising matching term:
  - learn desired denoising transition step: $p_\theta(x_{t-1})$
  - tractable, ground truth denoising transition step: $q(x_{t-1} | x_t, x_0)$
- Include gaussian form
  $\begin{aligned}& q\left(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t, \boldsymbol{x}_0\right)=\frac{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_{t-1}, \boldsymbol{x}_0\right) q\left(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_0\right)}{q\left(\boldsymbol{x}_t \mid \boldsymbol{x}_0\right)} \\& =\frac{\mathcal{N}\left(\boldsymbol{x}_t ; \sqrt{\alpha_t} \boldsymbol{x}_{t-1},\left(1-\alpha_t\right) \mathbf{I}\right) \mathcal{N}\left(\boldsymbol{x}_{t-1} ; \sqrt{\bar{\alpha}_{t-1}} \boldsymbol{x}_0,\left(1-\bar{\alpha}_{t-1}\right) \mathbf{I}\right)}{\mathcal{N}\left(\boldsymbol{x}_t ; \sqrt{\bar{\alpha}_t} \boldsymbol{x}_0,\left(1-\bar{\alpha}_t\right) \mathbf{I}\right)} \\& \propto \exp \left\{-\left[\frac{\left(\boldsymbol{x}_t-\sqrt{\alpha_t} \boldsymbol{x}_{t-1}\right)^2}{2\left(1-\alpha_t\right)}+\frac{\left(\boldsymbol{x}_{t-1}-\sqrt{\bar{\alpha}_{t-1}} \boldsymbol{x}_0\right)^2}{2\left(1-\bar{\alpha}_{t-1}\right)}-\frac{\left(\boldsymbol{x}_t-\sqrt{\bar{\alpha}_t} \boldsymbol{x}_0\right)^2}{2\left(1-\bar{\alpha}_t\right)}\right]\right\} \\& =\exp \left\{-\frac{1}{2}\left[\frac{\left(\boldsymbol{x}_t-\sqrt{\alpha_t} \boldsymbol{x}_{t-1}\right)^2}{1-\alpha_t}+\frac{\left(\boldsymbol{x}_{t-1}-\sqrt{\bar{\alpha}_{t-1}} \boldsymbol{x}_0\right)^2}{1-\bar{\alpha}_{t-1}}-\frac{\left(\boldsymbol{x}_t-\sqrt{\bar{\alpha}_t} \boldsymbol{x}_0\right)^2}{1-\bar{\alpha}_t}\right]\right\} \\& =\exp \left\{-\frac{1}{2}\left[\frac{\left(-2 \sqrt{\alpha_t} \boldsymbol{x}_t \boldsymbol{x}_{t-1}+\alpha_t \boldsymbol{x}_{t-1}^2\right)}{1-\alpha_t}+\frac{\left(\boldsymbol{x}_{t-1}^2-2 \sqrt{\bar{\alpha}_{t-1}} \boldsymbol{x}_{t-1} \boldsymbol{x}_0\right)}{1-\bar{\alpha}_{t-1}}+C\left(\boldsymbol{x}_t, \boldsymbol{x}_0\right)\right]\right\} \\& \propto \exp \left\{-\frac{1}{2}\left[-\frac{2 \sqrt{\alpha_t} \boldsymbol{x}_t \boldsymbol{x}_{t-1}}{1-\alpha_t}+\frac{\alpha_t \boldsymbol{x}_{t-1}^2}{1-\alpha_t}+\frac{\boldsymbol{x}_{t-1}^2}{1-\bar{\alpha}_{t-1}}-\frac{2 \sqrt{\bar{\alpha}_{t-1}} \boldsymbol{x}_{t-1} \boldsymbol{x}_0}{1-\bar{\alpha}_{t-1}}\right]\right\} \\& =\exp \left\{-\frac{1}{2}\left[\left(\frac{\alpha_t}{1-\alpha_t}+\frac{1}{1-\bar{\alpha}_{t-1}}\right) \boldsymbol{x}_{t-1}^2-2\left(\frac{\sqrt{\alpha_t} \boldsymbol{x}_t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}_{t-1}} \boldsymbol{x}_0}{1-\bar{\alpha}_{t-1}}\right) \boldsymbol{x}_{t-1}\right]\right\} \\& =\exp \left\{-\frac{1}{2}\left[\frac{\alpha_t\left(1-\bar{\alpha}_{t-1}\right)+1-\alpha_t}{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)} \boldsymbol{x}_{t-1}^2-2\left(\frac{\sqrt{\alpha_t} \boldsymbol{x}_t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}_{t-1}} \boldsymbol{x}_0}{1-\bar{\alpha}_{t-1}}\right) \boldsymbol{x}_{t-1}\right]\right\} \\& =\exp \left\{-\frac{1}{2}\left[\frac{\alpha_t-\bar{\alpha}_t+1-\alpha_t}{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)} \boldsymbol{x}_{t-1}^2-2\left(\frac{\sqrt{\alpha_t} \boldsymbol{x}_t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}_{t-1}} \boldsymbol{x}_0}{1-\bar{\alpha}_{t-1}}\right) \boldsymbol{x}_{t-1}\right]\right\} \\& =\exp \left\{-\frac{1}{2}\left[\frac{1-\bar{\alpha}_t}{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)} \boldsymbol{x}_{t-1}^2-2\left(\frac{\sqrt{\alpha_t} \boldsymbol{x}_t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}_{t-1}} \boldsymbol{x}_0}{1-\bar{\alpha}_{t-1}}\right) \boldsymbol{x}_{t-1}\right]\right\} \\& =\exp \left\{-\frac{1}{2}\left(\frac{1-\bar{\alpha}_t}{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)}\right)\left[x_{t-1}^2-2 \frac{\left(\frac{\sqrt{\alpha_t} \boldsymbol{x}_t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}_{t-1}} \boldsymbol{x}_0}{1-\bar{\alpha}_{t-1}}\right)}{\frac{1-\bar{\alpha}_t}{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)}} \boldsymbol{x}_{t-1}\right]\right\} \\& =\exp \left\{-\frac{1}{2}\left(\frac{1-\bar{\alpha}_t}{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)}\right)\left[\boldsymbol{x}_{t-1}^2-2 \frac{\left(\frac{\sqrt{\alpha_t} \boldsymbol{x}_t}{1-\alpha_t}+\frac{\sqrt{\bar{\alpha}_{t-1}} \boldsymbol{x}_0}{1-\bar{\alpha}_{t-1}}\right)\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} \boldsymbol{x}_{t-1}\right]\right\} \\& =\exp \left\{-\frac{1}{2}\left(\frac{1}{\frac{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t}}\right)\left[\boldsymbol{x}_{t-1}^2-2 \frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right) \boldsymbol{x}_t+\sqrt{\bar{\alpha}_{t-1}}\left(1-\alpha_t\right) \boldsymbol{x}_0}{1-\bar{\alpha}_t} \boldsymbol{x}_{t-1}\right]\right\} \\& \propto \mathcal{N}(\boldsymbol{x}_{t-1} ; \underbrace{\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right) \boldsymbol{x}_t+\sqrt{\bar{\alpha}_{t-1}}\left(1-\alpha_t\right) \boldsymbol{x}_0}{1-\bar{\alpha}_t}}_{\mu_q\left(\boldsymbol{x}_t, \boldsymbol{x}_0\right)}, \underbrace{\frac{\left(1-\alpha_t\right)\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} \mathbf{I}}_{\boldsymbol{\Sigma}_q(t)})\end{aligned}$
  - Thus, the $\mathbf{x}_{t-1} \sim q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$ is normally distributed, with mean $\mu_q(\mathbf{x}_t, \mathbf{x}_0)$ that is a function of $\mathbf{x}_t$ and $\mathbf{x}_0$ , and variance ${\Sigma}_q(t)$ as a function of α coefficients
- $p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$
  - From above, the ground truth $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$ follows the normal distribution. Thus, we also model $p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$ as a Gaussian
  - the variance can be computed since all variance terms are known to be frozen at each timestep and can set the variances of the two Gaussians to match exactly
  - must parameterise its mean as a function of $\mathbf{x}_t$
- Denoising matching term
  - recall the KL divergence definition, we want to optimize a $\mu_\theta(\mathbf{x}_t, t)$ to match $\mu_q(\mathbf{x}_t, \mathbf{x}_0)$
  $D_{\mathrm{KL}}\left( \mathcal{N}(x; \mu_x, \Sigma_x) \,\|\, \mathcal{N}(y; \mu_y, \Sigma_y) \right) = \frac{1}{2} \left[ \log \frac{|\Sigma_y|}{|\Sigma_x|} - d + \mathrm{tr}(\Sigma_y^{-1} \Sigma_x) + (\mu_y - \mu_x)^\top \Sigma_y^{-1} (\mu_y - \mu_x) \right]$
  - optimizing the KL Divergence term reduces to minimizing the difference between the means of the two distributions
    $\begin{aligned} \arg\min_{\boldsymbol{\theta}} \; & D_{\mathrm{KL}}(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) \,\|\, p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)) \\ = \arg\min_{\boldsymbol{\theta}} \; & D_{\mathrm{KL}}(\mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_q, \boldsymbol{\Sigma}_q(t)) \,\|\, \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta, \boldsymbol{\Sigma}_q(t))) \\ = \arg\min_{\boldsymbol{\theta}} \; & \frac{1}{2} \left[ \log \frac{|\boldsymbol{\Sigma}_q(t)|}{|\boldsymbol{\Sigma}_q(t)|} - d + \mathrm{tr}(\boldsymbol{\Sigma}_q(t)^{-1} \boldsymbol{\Sigma}_q(t)) + (\boldsymbol{\mu}_\theta - \boldsymbol{\mu}_q)^\top \boldsymbol{\Sigma}_q(t)^{-1} (\boldsymbol{\mu}_\theta - \boldsymbol{\mu}_q) \right] \\ = \arg\min_{\boldsymbol{\theta}} \; & \frac{1}{2} \left[ \log 1 - d + d + (\boldsymbol{\mu}_\theta - \boldsymbol{\mu}_q)^\top \boldsymbol{\Sigma}_q(t)^{-1} (\boldsymbol{\mu}_\theta - \boldsymbol{\mu}_q) \right] \\ = \arg\min_{\boldsymbol{\theta}} \; & \frac{1}{2} \left[ (\boldsymbol{\mu}_\theta - \boldsymbol{\mu}_q)^\top \boldsymbol{\Sigma}_q(t)^{-1} (\boldsymbol{\mu}_\theta - \boldsymbol{\mu}_q) \right] \\ = \arg\min_{\boldsymbol{\theta}} \; & \frac{1}{2} \left[ (\boldsymbol{\mu}_\theta - \boldsymbol{\mu}_q)^\top (\sigma_q^2(t) \mathbf{I})^{-1} (\boldsymbol{\mu}_\theta - \boldsymbol{\mu}_q) \right] \\ = \arg\min_{\boldsymbol{\theta}} \; & \frac{1}{2 \sigma_q^2(t)} \left[ \| \boldsymbol{\mu}_\theta - \boldsymbol{\mu}_q \|_2^2 \right] \end{aligned}$
    - So you are trying to minimize the L2 loss of the mean of learned distribution and the mean of conditional distribution that is given by GT.
  - set $\mu_\theta( \mathbf{x} _t, t)$ the same format as $\mu_q(\mathbf{x}_t, \mathbf{x}_0)$
    $\mu_q = \mu_q(\mathbf{x}_t, \mathbf{x}_0) = \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1}) \mathbf{x}_t + \sqrt{\bar{\alpha}_{t-1}} (1 - \alpha_t) \mathbf{x}_0}{1 - \bar{\alpha}_t} \\ \mu_{\theta} = \mu_{\theta}(\mathbf{x}_t, t) = \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1}) \mathbf{x}_t + \sqrt{\bar{\alpha}_{t-1}} (1 - \alpha_t) \hat{\mathbf{x}}_{\theta}(\mathbf{x}_t, t)}{1 - \bar{\alpha}_t}$
  - Then KL divergence can be rewrite as the following
    $\begin{aligned}&\arg\min_{\theta} D_{\text{KL}}(q(x_{t-1} \mid x_t, x_0) \parallel p_{\theta}(x_{t-1} \mid x_t)) \\&= \arg\min_{\theta} D_{\text{KL}}(\mathcal{N}(x_{t-1}; \mu_q, \Sigma_q(t)) \parallel \mathcal{N}(x_{t-1}; \mu_{\theta}, \Sigma_q(t))) \\&= \arg\min_{\theta} \frac{1}{2\sigma_q^2(t)} \left\| \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1}) x_t + \sqrt{\bar{\alpha}_{t-1}}(1 - \alpha_t) \hat{x}_{\theta}(x_t, t)}{1 - \bar{\alpha}_t} - \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1}) x_t + \sqrt{\bar{\alpha}_{t-1}}(1 - \alpha_t) x_0}{1 - \bar{\alpha}_t} \right\|_2^2 \\&= \arg\min_{\theta} \frac{1}{2\sigma_q^2(t)} \left\| \frac{\sqrt{\bar{\alpha}_{t-1}}(1 - \alpha_t) \hat{x}_{\theta}(x_t, t)}{1 - \bar{\alpha}_t} - \frac{\sqrt{\bar{\alpha}_{t-1}}(1 - \alpha_t) x_0}{1 - \bar{\alpha}_t} \right\|_2^2 \\&= \arg\min_{\theta} \frac{1}{2\sigma_q^2(t)} \left\| \frac{\sqrt{\bar{\alpha}_{t-1}}(1 - \alpha_t)}{1 - \bar{\alpha}_t} (\hat{x}_{\theta}(x_t, t) - x_0) \right\|_2^2 \\&= \arg\min_{\theta} \frac{1}{2\sigma_q^2(t)} \frac{\bar{\alpha}_{t-1} (1 - \alpha_t)^2}{(1 - \bar{\alpha}_t)^2} \left\| \hat{x}_{\theta}(x_t, t) - x_0 \right\|_2^2\end{aligned}$
  - Therefore, optimizing a diffusion model boils down to learning a neural network to predict the original ground truth image from arbitrarily noisified version of it.
  - Furthermore, minimizing the summation term of our derived ELBO objective across all noise levels can be approximated by minimizing the expectation over all timesteps, which can then be optimized using stochastic samples over timesteps.
- Three equivalent interpretations
  - Interpretation 1: Learning to predict the noise
    1. $x_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon_0}{\sqrt{\bar{\alpha}_t}}$
    1. Then
      $\begin{aligned} \mu_q(x_t, x_0) &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1}) x_t + \sqrt{\bar{\alpha}_{t-1}}(1 - \alpha_t) x_0}{1 - \bar{\alpha}_t} \\ &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1}) x_t + \sqrt{\bar{\alpha}_{t-1}}(1 - \alpha_t) \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon_0}{\sqrt{\bar{\alpha}_t}}}{1 - \bar{\alpha}_t} \\ &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1}) x_t + (1 - \alpha_t) x_t - (1 - \alpha_t) \sqrt{1 - \bar{\alpha}_t} \epsilon_0}{(1 - \bar{\alpha}_t) \sqrt{\alpha_t}} \\ &= \left( \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} + \frac{1 - \alpha_t}{(1 - \bar{\alpha}_t) \sqrt{\alpha_t}} \right) x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t} \sqrt{\alpha_t}} \epsilon_0 \\ &= \left( \frac{\alpha_t (1 - \bar{\alpha}_{t-1}) + 1 - \alpha_t}{(1 - \bar{\alpha}_t) \sqrt{\alpha_t}} \right) x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t} \sqrt{\alpha_t}} \epsilon_0 \\ &= \frac{1}{\sqrt{\alpha_t}} x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t} \sqrt{\alpha_t}} \epsilon_0 \end{aligned}$
    1. Therefore, we can set our approximate
      $\mu_{\theta}(x_t, t) = \frac{1}{\sqrt{\alpha_t}} x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t} \sqrt{\alpha_t}} \hat{\epsilon}_{\theta}(x_t, t)$
    1. Then the optimization would be
      $\begin{aligned} &\arg\min_{\theta} D_{\text{KL}}(q(x_{t-1} \mid x_t, x_0) \parallel p_{\theta}(x_{t-1} \mid x_t)) \\ &= \arg\min_{\theta} D_{\text{KL}}(\mathcal{N}(x_{t-1}; \mu_q, \Sigma_q(t)) \parallel \mathcal{N}(x_{t-1}; \mu_{\theta}, \Sigma_q(t))) \\ &= \arg\min_{\theta} \frac{1}{2\sigma_q^2(t)} \left\| \frac{1}{\sqrt{\alpha_t}} x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t} \sqrt{\alpha_t}} \hat{\epsilon}_{\theta}(x_t, t) - \frac{1}{\sqrt{\alpha_t}} x_t + \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t} \sqrt{\alpha_t}} \epsilon_0 \right\|_2^2 \\ &= \arg\min_{\theta} \frac{1}{2\sigma_q^2(t)} \left\| \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t} \sqrt{\alpha_t}} (\epsilon_0 - \hat{\epsilon}_{\theta}(x_t, t)) \right\|_2^2 \\ &= \arg\min_{\theta} \frac{1}{2\sigma_q^2(t)} \frac{(1 - \alpha_t)^2}{(1 - \bar{\alpha}_t) \alpha_t} \left\| \epsilon_0 - \hat{\epsilon}_{\theta}(x_t, t) \right\|_2^2 \end{aligned}$
    - Here, $\hat{\epsilon}_\theta(\mathbf{x}_t, t)$ is a neural network that learns to predict the source noise $\epsilon_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ that determines $\mathbf{x}_t$ from $\mathbf{x}_0$ .
    - We have therefore shown that learning a VDM by predicting the original image x0 is equivalent to learning to predict the noise; empirically, however, some works have found that predicting the noise resulted in better performance.
    - So you are trying to predict the noise that was added at time 0 to get to the time t. Know how to add then know how to subtract
  - Interpretation 2: Score function

Training

Load in some images from the training data

Add noise, in different amounts.
1. Remember, we want the model to do a good job estimating how to ‘fix’ (denoise) both extremely noisy images and images that are close to perfect.

Feed the noisy versions of the inputs into the model

Evaluate how well the model does at denoising these inputs

Use this information to update the model weights

pass that noised image into your model, and predict the noise added

minimise the predict noise with the actual noise

Training Architecture

Diffusion models often use U-Net architectures with ResNet blocks and self-attention layers to represent $\hat{\epsilon}_\theta(\mathbf{x}_t, t)$

Time representation: sinusoidal positional embeddings or random Fourier features.

Time features are fed to the residual blocks using either simple spatial addition or using adaptive group normalization
layers.

Parameters
- $\beta_t$ and $\sigma_t^2$ control the variance of the forward diffusion and reverse denoising processes respectively.
- Often a linear schedule is used for $\beta_t$ , and $\sigma_t^2$ is set equal to $\beta_t$ .
- Kingma et al. NeurIPS 2022 introduce a new parameterization of diffusion models using signal-to-noise ratio (SNR), and show how to learn the noise schedule by minimizing the variance of the training objective.
- We can also train $\sigma_t^2$ while training the diffusion model by minimizing the variational bound (Improved DPM by Nichol and Dhariwal ICML 2021) or after training the diffusion model (Analytic-DPM by Bao et al. ICLR 2022).

Generation

start from image of random noise

pass this completely random noise into the model

model predict some noise from that

subtract the predicted noise from the noise image

UNet

A key feature of this model is that it predicts images of the same size as the input, which is exactly what we need here.

When dealing with higher-resolution inputs you may want to use more down and up-blocks, and keep the attention layers only at the lowest resolution (bottom) layers to reduce memory usage.

Encoder Downsampling
- The encoder is responsible for feature extraction and representation learning. It uses specific computational methods to represent the entire image with a more compact form of data. This smaller representation captures high-level characteristics of the image, rather than fine-grained pixel-level details. The more compact the data, the more abstract and semantic the representation becomes, which helps in understanding the overall content of the image.
- For example, consider an image of a cat playing in front of a house. The original image retains all fine details, but to identify the presence of the cat and the house, a model would need to perform extensive operations across all pixels. In contrast, when downsampled to a coarser representation, these high-level objects like “cat” and “house” become easier to recognize.
- The encoder module can be implemented using top-tier feature extractors such as ResNet, VGG, or EfficientNet, which gives it strong potential for both engineering applications and research advancements.
- At the same time, the encoder can improve robustness to noisy perturbations, reduce the risk of overfitting, lower computational cost, and enlarge the receptive field, among other benefits.

Decoder Upsampling
- The decoder is responsible for restoring the macro-level, low-resolution representation of an image back to its original resolution.
- For example, in image segmentation (where the goal is to separate objects from the background), we may identify object boundaries and key features in the downsampled high-level representation, but this information must ultimately be mapped back to the original image size to be useful.
- But naturally, one might ask: after so much downsampling and loss of detail, how can we possibly recover fine-grained information during upsampling? This is where skip connections come into play.
- The decoder reconstructs the feature map to its original resolution, and through skip connections, it fuses shallow-layer spatial information with deep-layer semantic understanding. Like the encoder, the decoder can also be implemented using powerful models such as ResNet, VGG, or EfficientNet, making U-Net variants highly diverse and adaptable—offering strong potential for architectural innovation.
- Additionally, the decoder has inherent denoising capabilities, helping to refine outputs by integrating coarse-to-fine representations and filtering out irrelevant noise from the final result.

Skip connection
- At each upsampling layer, the decoder concatenates the corresponding feature maps from the encoder (i.e., from the same resolution level on the left) and uses this combined information as input for the next upsampling stage.
- The reason for doing this is intuitive:
  - Each layer in the encoder preserves image details to varying degrees, while the decoder layers—obtained through upsampling—primarily carry high-level, abstract information but lack fine-grained details.
  - By concatenating the encoder’s detailed feature maps with the decoder’s abstract representations, the network gains access to both macro-level semantics and micro-level details.
  - As a result, every layer in the decoder has a multi-scale understanding of the image, making it capable of performing various tasks such as image segmentation and semantic image generation.

Diffuison in Differential Equations

Recall what is Differential Equations
- Ordinary Differential Equation (ODE):
  $\begin{aligned} \frac{\mathrm{d} \mathbf{x}}{\mathrm{d}t} = \mathbf{f}(\mathbf{x}, t) \\ \quad\mathrm{d}\mathbf{x} = \mathbf{f}(\mathbf{x}, t) \, \mathrm{d}t \end{aligned}$
  - Analytical Solution: $\mathbf{x}(t) = \mathbf{x}(0) + \int_0^t \mathbf{f}(\mathbf{x}, \tau) \, \mathrm{d}\tau$
  - Iterative Numerical Solution: $\mathbf{x}(t) = \mathbf{x}(0) + \int_0^t \mathbf{f}(\mathbf{x}, \tau) \, \mathrm{d}\tau$
- Stochastic Differential Equation (SDE):
  $\begin{aligned} \frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} = \underbrace{\mathbf{f}(\mathbf{x}, t)}_{\substack{\text{drift coefficient}}} + \underbrace{\sigma(\mathbf{x}, t) \, \boldsymbol{\omega}_t}_{\substack{\text{diffusion coefficient}}} \\ \mathrm{d}\mathbf{x} = \mathbf{f}(\mathbf{x}, t) \, \mathrm{d}t + \sigma(\mathbf{x}, t) \, \mathrm{d}\boldsymbol{\omega}_t \end{aligned}$
  - $\boldsymbol{\omega}_t$ : Wiener Process (Gaussian White Noise)
  - Solution: $\mathbf{x}(t + \Delta t) \approx \mathbf{x}(t) + \mathbf{f}(\mathbf{x}(t), t) \, \Delta t + \sigma(\mathbf{x}(t), t) \sqrt{\Delta t} \, \mathcal{N}(\mathbf{0}, \mathbf{I})$

Diffusion model in SDE
1. Forward
  1. First, we have
    $\begin{aligned}\mathbf{x}_t &= \sqrt{1 - \beta_t} \, \mathbf{x}_{t-1} + \sqrt{\beta_t} \, \mathcal{N}(\mathbf{0}, \mathbf{I}) \\&= \sqrt{1 - \beta(t)\Delta t} \, \mathbf{x}_{t-1} + \sqrt{\beta(t)\Delta t} \, \mathcal{N}(\mathbf{0}, \mathbf{I}) \quad (\beta_t := \beta(t)\Delta t) \\&\approx \mathbf{x}_{t-1} - \frac{\beta(t)\Delta t}{2} \, \mathbf{x}_{t-1} + \sqrt{\beta(t)\Delta t} \, \mathcal{N}(\mathbf{0}, \mathbf{I}) \quad \text{(Taylor expansion)}\end{aligned}$
  1. Then we have, which is Stochastic Differential Equation (SDE) describing the diffusion in infinitesimal limit
    $\mathrm{d}\mathbf{x}_t = \underbrace{-\frac{1}{2} \beta(t)\mathbf{x}_t \, \mathrm{d}t}_{\substack{\text{drift term} \\ \text{(pulls towards mode)}}} + \underbrace{\sqrt{\beta(t)} \, \mathrm{d}\boldsymbol{\omega}_t}_{\substack{\text{diffusion term} \\ \text{(injects noise)}}}$
1. Reverse
  1. From SDE
    $\mathrm{d}\mathbf{x}_t = \underbrace{ \left[ -\frac{1}{2} \beta(t)\mathbf{x}_t - \beta(t) \, \textcolor{red}{\nabla_{\mathbf{x}_t} \log q_t(\mathbf{x}_t)} \right]}_{\substack{\text{drift term} \\ \text{``Score Function''}}} \, \mathrm{d}t + \underbrace{ \sqrt{\beta(t)} \, \mathrm{d}\boldsymbol{\omega}_t }_{\text{diffusion term}}$
  1. Then we need to get the score function ${\nabla_{\mathbf{x}_t} \log q_t(\mathbf{x}_t)}$
    1. Naïve idea, learn model for the score function by direct regression? But ${\nabla_{\mathbf{x}_t} \log q_t(\mathbf{x}_t)}$ (score of the marginal diffused density $q_t(\mathbf{x}_t)$ is not tractable!
      $\min_\theta \; \underbrace{ \mathbb{E}_{t \sim \mathcal{U}(0, T)} }_{\text{diffusion} \ t} \; \underbrace{ \mathbb{E}_{\mathbf{x}_t \sim q_t(\mathbf{x}_t)} }_{\text{diffused data } \mathbf{x}_t} \left\| \underbrace{ \mathbf{s}_\theta(\mathbf{x}_t, t) }_{\text{neural network}} - \underbrace{ \nabla_{\mathbf{x}_t} \log q_t(\mathbf{x}_t) }_{\substack{\text{score of} \\ \text{diffused data (marginal)}}} \right\|_2^2$
    1. Instead, diffuse individual data points $\mathbf{x}_0$ . Diffused $q_t(\mathbf{x}_t|\mathbf{x}_0)$ is tractable!
      $\min_\theta \; \underbrace{ \mathbb{E}_{t \sim \mathcal{U}(0, T)} }_{\text{diffusion} \ t} \; \underbrace{ \mathbb{E}_{\mathbf{x}_0 \sim q_0(\mathbf{x}_0)} }_{\text{data sample } \mathbf{x}_0} \; \underbrace{ \mathbb{E}_{\mathbf{x}_t \sim q_t(\mathbf{x}_t | \mathbf{x}_0)} }_{\text{diffused data sample } \mathbf{x}_t} \left\| \underbrace{ \mathbf{s}_\theta(\mathbf{x}_t, t) }_{\text{neural network}} - \underbrace{ \nabla_{\mathbf{x}_t} \log q_t(\mathbf{x}_t | \mathbf{x}_0) }_{\substack{\text{score of diffused} \\ \text{data sample}}} \right\|_2^2$
  1. Specifically, we can do the Denoising Score Matching
    $\min_\theta \; \mathbb{E}_{t \sim \mathcal{U}(0, T)} \;\mathbb{E}_{\mathbf{x}_0 \sim q_0(\mathbf{x}_0)} \;\mathbb{E}_{\mathbf{x}_t \sim q_t(\mathbf{x}_t \mid \mathbf{x}_0)} \;\left\|\mathbf{s}_\theta(\mathbf{x}_t, t) - \nabla_{\mathbf{x}_t} \log q_t(\mathbf{x}_t \mid \mathbf{x}_0)\right\|_2^2$
    1. From “Variance Preserving” SDE
      $\begin{align*}q_t(\mathbf{x}_t \mid \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_t; \gamma_t \mathbf{x}_0, \sigma_t^2 \mathbf{I}) \\\gamma_t &= \exp\left( -\frac{1}{2} \int_0^t \beta(s) \, \mathrm{d}s \right) \\\sigma_t^2 &= 1 - \exp\left( -\int_0^t \beta(s) \, \mathrm{d}s \right)\end{align*}$
    1. Re-parametrized sampling:
      $\mathbf{x}_t = \gamma_t \mathbf{x}_0 + \sigma_t \boldsymbol{\epsilon}\quad \text{where} \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
    1. Score function:
      $\begin{align*}\nabla_{\mathbf{x}_t} \log q_t(\mathbf{x}_t \mid \mathbf{x}_0)&= -\nabla_{\mathbf{x}_t} \left( \frac{(\mathbf{x}_t - \gamma_t \mathbf{x}_0)^2}{2\sigma_t^2} \right) \\&= -\frac{\mathbf{x}_t - \gamma_t \mathbf{x}_0}{\sigma_t^2} \\&= -\frac{\gamma_t \mathbf{x}_0 + \sigma_t \boldsymbol{\epsilon} - \gamma_t \mathbf{x}_0}{\sigma_t^2} \\&= -\frac{\sigma_t \boldsymbol{\epsilon}}{\sigma_t^2} \\&= \frac{-\boldsymbol{\epsilon}}{\sigma_t}\end{align*}$
    1. Neural network model:
      $\mathbf{s}_\theta(\mathbf{x}_t, t) := -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sigma_t}$
    1. Finally, it can be written as
      $\min_\theta \; \mathbb{E}_{t \sim \mathcal{U}(0, T)} \;\mathbb{E}_{\mathbf{x}_0 \sim q_0(\mathbf{x}_0)} \;\mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \;\frac{1}{\sigma_t^2}\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right\|_2^2$
    1. Furthermore, we can add loss weight
      $\min_\theta \; \mathbb{E}_{t \sim \mathcal{U}(0, T)} \; \mathbb{E}_{\mathbf{x}_0 \sim q_0(\mathbf{x}_0)} \; \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \; \textcolor{red}{\lambda(t)} \cdot \frac{1}{\sigma_t^2} \left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right\|_2^2$
      - Different loss weightings trade off between model with good perceptual quality vs. high log-likelihood
      - Perceptual quality: $\textcolor{red}{\lambda(t)} = \sigma_t^2$
      - Maximum log-likelihood: $\textcolor{red}{\lambda(t)} = \beta(t)$
      - Notice , $\sigma_t^2 \to 0, \text{as } t \to 0$ . Loss heavily amplified when sampling $t$ close to 0 (for $\textcolor{red}{\lambda(t)} = \beta(t)$ ). High variance!
  1. Train with small time cut-off $\textcolor{red}{\eta} \; (\approx 10^{-5})$
    $\min_\theta \; \mathbb{E}_{t \sim \mathcal{U}(\textcolor{red}{\eta}, T)} \; \mathbb{E}_{\mathbf{x}_0 \sim q_0(\mathbf{x}_0)} \; \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \; \frac{\textcolor{red}{\lambda(t)}}{\sigma_t^2} \left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right\|_2^2$
  1. Variance reduction by Importance Sampling $r(t) \propto \frac{\lambda(t)}{\sigma_t^2}$ :
  $\begin{aligned} \min_\theta \; \mathbb{E}_{t \sim r(t)} \; \mathbb{E}_{\mathbf{x}_0 \sim q_0(\mathbf{x}_0)} \; \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \; \frac{1}{r(t)} \cdot \frac{\lambda(t)}{\sigma_t^2} \left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right\|_2^2 \end{aligned}$

Diffusion models in ODE

Flow Matching

The vector field between noise and data distribution can be learnt directly via Flow Matching.

What makes a good generative model?

Model Improvements

Algorithmic Improvements

Learned variance

Deterministic Sampler

DDPM

DDIM

Score-based Models

Faster Sampling

DDIM

PFGM

Architecture Improvements

Classifier Guidance

Key idea: add an additional term in the sampling process to direct towards the desired class

Train a classifier on noisy images

use gradient to guide towards a class label

Results: similar image quality but better diversity

Classifier-Free Diffusion Guidance

Motivation: Classifier Guidance enabled “low temperature” / “sharper” samples, but we want to achieve this without additional classifier to train, purely in the generative model itself

Key idea: consider extra term in classifier free guidance sampling + apply Bates rules

Diffusion Model

Overview

Forward Process / diffusion process

Goal

Get approximate posterior $q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)$

Assume noise follow Gaussian distribution

Tricks for efficient training

Diffused Data Distribution $q\left(\mathbf{x}_{t} \right)$

Reverse Process / denoising process

Goal

Diffused Data Distribution $q\left(\mathbf{x}_{t} \right)$

$q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)$ is unknown

Approximate the reverse process / learn to calculate the $q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)$

Optimisation

ELBO: Evidence Lower Bound

Variational Autoencoders

ELBO optimized jointly over parameters $\phi$ and $\theta$ in VAE

Hierarchical Variational Autoencoders

Variational Diffusion Models

Training

Training Architecture

Generation

UNet

Diffuison in Differential Equations

Flow Matching

What makes a good generative model?

Model Improvements

Algorithmic Improvements

Architecture Improvements

Classifier Guidance

Classifier-Free Diffusion Guidance

Conditional Diffusion Models

GLIDE: Text-Guided Diffusion Models

DallE

Imagen

References

Overview

Forward Process / diffusion process

Goal

Get approximate posterior @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.9/katex.min.css')q(x1:T∣x0)q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)q(x1:T​∣x0​)﻿

Assume noise follow Gaussian distribution

Tricks for efficient training

Diffused Data Distribution @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.9/katex.min.css')q(xt)q\left(\mathbf{x}_{t} \right)q(xt​)﻿

Reverse Process / denoising process

Goal

Diffused Data Distribution @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.9/katex.min.css')q(xt)q\left(\mathbf{x}_{t} \right)q(xt​)﻿

@import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.9/katex.min.css')q(xt−1∣xt)q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)q(xt−1​∣xt​)﻿ is unknown

Approximate the reverse process / learn to calculate the @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.9/katex.min.css')q(xt−1∣xt)q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)q(xt−1​∣xt​)﻿

Optimisation

ELBO: Evidence Lower Bound

Variational Autoencoders

ELBO optimized jointly over parameters @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.9/katex.min.css')ϕ\phiϕ﻿ and @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.16.9/katex.min.css')θ\thetaθ﻿ in VAE

Hierarchical Variational Autoencoders

Variational Diffusion Models

Training

Training Architecture

Generation

UNet

Diffuison in Differential Equations

Flow Matching

What makes a good generative model?

Model Improvements

Algorithmic Improvements

Architecture Improvements

Classifier Guidance

Classifier-Free Diffusion Guidance

Conditional Diffusion Models

GLIDE: Text-Guided Diffusion Models

DallE

Imagen

References

Get approximate posterior $q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)$

Diffused Data Distribution $q\left(\mathbf{x}_{t} \right)$

Diffused Data Distribution $q\left(\mathbf{x}_{t} \right)$

$q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)$ is unknown

Approximate the reverse process / learn to calculate the $q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)$

ELBO optimized jointly over parameters $\phi$ and $\theta$ in VAE