References: Diffusion models I, the forward and reverse processes

Source material

Source curricula (multi-source structural mirror; cited as further study):

PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 16: Score Based Diffusion Models
  Instructor: Stefano Ermon
  Course URL: https://deepgenerativemodels.github.io/
  Syllabus: https://deepgenerativemodels.github.io/syllabus.html
  License: standard course-page link-out; cited as further study

SECONDARY (CS294-158 has a dedicated diffusion lecture)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024), Lecture 6: Diffusion Models
  Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
  Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
  License: standard course-page link-out; cited as further study

Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.

Watch this next

Stanford CS236 (Stefano Ermon), course homepage. Lecture 16 (Score Based Diffusion Models) is the primary anchor; it covers the forward and reverse chains, the ELBO derivation, and the simplified DDPM loss. Notes at deepgenerativemodels.github.io/notes include the simplification step by step.
Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. Lecture 6 is the secondary anchor; the diffusion lecture covers the same DDPM derivation with additional material on practical training tricks (timestep embedding, EMA on weights, schedule choices).

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“Denoising Diffusion Probabilistic Models” (Ho, Jain, Abbeel, 2020). The DDPM paper that broke diffusion models open as a practical paradigm. Section 2 derives the forward chain and the closed-form shortcut; Section 3 derives the reverse-process objective and the noise-prediction simplification (Section 3.2). The empirical results on CIFAR-10 and CelebA-HQ are what convinced the field. Read this paper if you read any one paper in Phase 3.
“Deep Unsupervised Learning using Nonequilibrium Thermodynamics” (Sohl-Dickstein, Weiss, Maheswaranathan, Ganguli, 2015). The original diffusion-model paper from 2015, which introduced the Markov-chain noising-and-denoising framework that DDPM later refined. Useful for context on where the framework came from; the 2015 paper got lost in the noise of the time and was rediscovered five years later as the basis for modern image generation.
“What are Diffusion Models?” by Lilian Weng (2021). The most-cited blog-post introduction to diffusion models, by an OpenAI researcher. Walks the DDPM derivation step by step with cleaner notation than the original paper. Useful when you want to see the algebra written carefully.
“Tutorial on Diffusion Models for Imaging and Vision” (Chan, 2024). A long, careful tutorial that covers everything from DDPM through latent diffusion and consistency models. Useful as a one-stop reference once you have read DDPM and want a unified treatment of the field.

Adjacent topics

Where this sits in the track.

Score matching and score-based generation (previous lesson). L11 derived the denoising-score-matching objective; this lesson derives the same loss from the Markov-chain perspective. The L11 perspective treats noise as a continuous parameter σ; this lesson treats it as a discrete timestep t with cumulative noise sqrt(1 − ᾱ_t). The conceptual identification (noise level ↔ timestep) is what makes the two derivations equivalent.
Diffusion models II, training and sampling (next lesson, L13). L13 covers the practical sampling-speed optimizations that turned diffusion from theoretically clean to production-grade: DDIM (reducing T from 1000 to ~50 deterministic steps), classifier-free guidance (text-conditioning with adjustable strength), and the diffusion-specific aspects of inference cost. L13 will also cover the §6 watch with the same five-layer pattern applied at the conditioning level.
Score-based diffusion via SDEs, the unifying view (L14). L14 returns to the score-matching view from L11 and makes the equivalence with the DDPM Markov-chain view explicit via the stochastic differential equation perspective. Both DDPM and continuous-time score-based models are discretizations of the same underlying SDE.
Latent variables and the ELBO (lesson 5). Diffusion’s training objective is derived as an ELBO over the chain of latents x_1, ..., x_T, generalizing the single-latent VAE ELBO of lesson 5. The L5 machinery is the workhorse here; the diffusion-specific simplification is the noise-prediction reparameterization that collapses the chain of KL terms into a single MSE loss.