Diffusion Model

Overview

Forward Process / diffusion process

Goal

Get approximate posterior q(x1:Tx0)q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)

q(x1:Tx0):=t=1Tq(xtxt1)q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right):=\prod_{t=1}^T q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)

Assume noise follow Gaussian distribution

q(xtxt1):=N(xt;1βtxt1,βtI)q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right):=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}\right)

Tricks for efficient training

Diffused Data Distribution q(xt)q\left(\mathbf{x}_{t} \right)

Reverse Process / denoising process

Goal

Diffused Data Distribution q(xt)q\left(\mathbf{x}_{t} \right)

q(xt1xt)q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right) is unknown

Approximate the reverse process / learn to calculate the q(xt1xt)q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))such thatpθ(x0:T)=p(xT)t=1Tpθ(xt1xt)withp(xT)=N(xT;0,I) \begin{aligned} p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) &= \mathcal{N} \left( \mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) \right) \\ \text{such that} \quad p_\theta(\mathbf{x}_{0:T}) &= p(\mathbf{x}_T) \prod_{t=1}^{T} p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) \\ \text{with} \quad p(\mathbf{x}_T) &= \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I}) \end{aligned}

Optimisation

ELBO: Evidence Lower Bound

Variational Autoencoders

Eqϕ(zx)[logp(x,z)qϕ(zx)]=Eqϕ(zx)[logpθ(xz)p(z)qϕ(zx)](Chain Rule of Probability)=Eqϕ(zx)[logpθ(xz)]+Eqϕ(zx)[logp(z)qϕ(zx)](Split the Expectation)=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))(Definition of KL Divergence)=Eqϕ(zx)[logpθ(xz)]reconstruction termDKL(qϕ(zx)p(z))prior matching term\begin{aligned}\mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \right] &= \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \log \frac{p_\theta(\boldsymbol{x}|\boldsymbol{z}) p(\boldsymbol{z})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \right] && \text{(Chain Rule of Probability)}\\&= \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} [\log p_\theta(\boldsymbol{x}|\boldsymbol{z})] + \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \left[ \log \frac{p(\boldsymbol{z})}{q_\phi(\boldsymbol{z}|\boldsymbol{x})} \right] && \text{(Split the Expectation)}\\&= \mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} [\log p_\theta(\boldsymbol{x}|\boldsymbol{z})] - D_{KL}(q_\phi(\boldsymbol{z}|\boldsymbol{x}) \parallel p(\boldsymbol{z})) && \text{(Definition of KL Divergence)} \\&= \underbrace{\mathbb{E}_{q_\phi(\boldsymbol{z}|\boldsymbol{x})} [\log p_\theta(\boldsymbol{x}|\boldsymbol{z})]}_{\text{reconstruction term}} - \underbrace{D_{KL}(q_\phi(\boldsymbol{z}|\boldsymbol{x}) \parallel p(\boldsymbol{z}))}_{\text{prior matching term}}\end{aligned}

ELBO optimized jointly over parameters ϕ\phi and θ\theta in VAE

The VAE utilizes the reparameterization trick and Monte Carlo estimates to optimize the ELBO jointly over ϕ\phi and θ\theta.

Hierarchical Variational Autoencoders

Variational Diffusion Models

Training

  1. Load in some images from the training data
  1. Add noise, in different amounts.
    1. Remember, we want the model to do a good job estimating how to ‘fix’ (denoise) both extremely noisy images and images that are close to perfect.
  1. Feed the noisy versions of the inputs into the model
  1. Evaluate how well the model does at denoising these inputs
  1. Use this information to update the model weights

  1. pass that noised image into your model, and predict the noise added
  1. minimise the predict noise with the actual noise

Training Architecture

Generation

  1. start from image of random noise
  1. pass this completely random noise into the model
  1. model predict some noise from that
  1. subtract the predicted noise from the noise image

UNet

Diffuison in Differential Equations

Flow Matching

What makes a good generative model?

Model Improvements

Algorithmic Improvements

DDPM

DDIM

Score-based Models

Faster Sampling

Architecture Improvements

Classifier Guidance

Classifier-Free Diffusion Guidance

Conditional Diffusion Models

GLIDE: Text-Guided Diffusion Models

DallE

Imagen

References

  1. https://zhuanlan.zhihu.com/p/642354007