Skip to content

Cheatsheet: Diffusion models I, the forward and reverse processes

q(x_t | x_{t-1}) = N( x_t ; sqrt(1 − β_t) · x_{t-1}, β_t · I )

β_t is the noise schedule, small positive values (~10⁻⁴ to 0.02), T = 1000 typical. No learnable parameters here.

After T steps: q(x_T) ≈ N(0, I), pure Gaussian noise.

α_t = 1 − β_t, ᾱ_t = α_1 · α_2 · ... · α_t
q(x_t | x_0) = N( sqrt(ᾱ_t) · x_0, (1 − ᾱ_t) · I )
x_t = sqrt(ᾱ_t) · x_0 + sqrt(1 − ᾱ_t) · ε, ε ~ N(0, I)

Sample any timestep in one operation instead of t sequential steps. Computational hinge of training.

Worked anchor: T=2, β=0.1: α=0.9, ᾱ_2=0.81. So x_2 = 0.9·x_0 + 0.436·ε. With T=1000 and growing β, ᾱ_T → 0, so x_T → ε (pure noise).

p_θ(x_{t-1} | x_t) = N( μ_θ(x_t, t), Σ_θ(x_t, t) )

Standard DDPM fixes Σ_θ to a function of the noise schedule; only μ_θ is learned.

Noise-predictor reparameterization:

μ_θ(x_t, t) = (1/sqrt(α_t)) · ( x_t - (β_t/sqrt(1−ᾱ_t)) · ε_θ(x_t, t) )

ε_θ(x_t, t) is the trained noise-prediction network.

The simplified DDPM loss (= denoising score matching)

Section titled “The simplified DDPM loss (= denoising score matching)”
L_simple(θ) = E_{t ~ Uniform{1..T}, x_0 ~ p_data, ε ~ N(0,I)}[
|| ε - ε_θ(x_t, t) ||²
]
where x_t = sqrt(ᾱ_t) · x_0 + sqrt(1 − ᾱ_t) · ε

This is exactly DSM (L11) at noise level sqrt(1 − ᾱ_t). The L11 score-based derivation and the L12 Markov-chain derivation arrive at the same loss from two paths. L14 makes the equivalence formal via SDE view.

repeat:
x_0 ~ p_data
t ~ Uniform({1, ..., T})
ε ~ N(0, I)
x_t = sqrt(ᾱ_t)·x_0 + sqrt(1−ᾱ_t)·ε
loss = || ε - ε_θ(x_t, t) ||²
θ ← θ - η · ∇_θ loss

One MSE per step, sampled uniformly across timesteps. Architecture: typically U-Net (sometimes transformer).

Sampling loop (reverse chain, T forward passes)

Section titled “Sampling loop (reverse chain, T forward passes)”
x_T ~ N(0, I)
for t = T, T-1, ..., 1:
z = N(0, I) if t > 1 else 0
ε̂ = ε_θ(x_t, t)
x_{t-1} = (1/sqrt(α_t)) · (x_t - (β_t/sqrt(1−ᾱ_t))·ε̂) + σ_t·z
return x_0

Slow at inference (T=1000 forward passes in original DDPM; DDIM reduces to ~50, L13 covers).

PerspectivePath
Score matching (L11)noise data with σ → score = -ε/σ → DSM loss → multi-noise-level extension
DDPM (L12)fixed Markov noising chain → ELBO over latents → reparameterize as noise predictor → simplified loss

Same loss. Noise level σ (L11) corresponds to sqrt(1 − ᾱ_t) (L12). Same model, two derivations.

A note on what this lesson does NOT cover (§6 in-body checkpoint, 5 layers)

Section titled “A note on what this lesson does NOT cover (§6 in-body checkpoint, 5 layers)”

Diffusion powers most modern synthetic-media generation. Six distinct policy/governance forums sit outside this mechanical scope:

CategoryWhat it covers
Use-case appropriatenessSynthetic faces/voices/video of identifiable people (consent)
Provenance + watermarkingAttribution; latent-diffusion-specific concerns vs pixel diffusion
Sector-specificJournalism, politics, legal evidence, MEDICAL IMAGING (new for diffusion vs GAN era)
Training-data IPScraped-corpora licensing (LAION-style claims)
Likeness + consentMore pronounced for diffusion than VAE/GAN due to output quality
Prompt-injection content risksNEW for text-conditioned diffusion (refusal-systems prompted around)

Evaluation methods for this lesson’s scope (operational test): FID across step counts, IS, CLIP scores, sample-quality-vs-step-count Pareto, perceptual studies, memorization probes. If you’re using these tools, you’re in scope. If you’re using policy-debate methods (stakeholder analysis, legal frameworks, regulatory comment, sectoral standards), you’re in a different conversation.

Inform-not-settle (Layer 5): empirical questions (“Does this model reproduce training-image-like content?”) settled by FID + memorization probes. Value questions (“Should it train on scraped corpora?”) inform-but-don’t-settle. Defense-in-depth: training-data curation = engineering (this lesson’s scope); IP licensing = policy (different forum). Both needed; neither alone sufficient.

  • Treating forward process as learned. No, it’s fixed by the β-schedule. Only reverse has parameters.
  • Forgetting the closed-form shortcut. Training without it = T sequential ops per example, infeasible.
  • Confusing DDPM loss with separate paradigm. DDPM loss IS DSM at the corresponding noise level. Same equation.
  • Skipping the §6 boundary. Mechanical content (this lesson) and policy questions (different forums) deserve to be kept separate; mixing muddles both.

Diffusion fixes a Markov chain that noises data into Gaussian, learns a reverse chain parameterized as a noise predictor, and trains by a simple MSE on noise prediction that is mathematically the L11 denoising-score-matching loss; the closed-form forward shortcut makes training feasible, the reverse chain runs at inference for T steps, and L13/L14 cover speed-ups and the score-based unification.