References: Diffusion models
Source material
Section titled “Source material”This lesson follows Stanford CS231n’s treatment of diffusion models (Lecture 14: Generative Models 2).
- Course: Stanford CS231n, “Deep Learning for Computer Vision”
- Instructors: Fei-Fei Li, Ehsan Adeli, and Justin Johnson (Stanford University)
- Course site: cs231n.stanford.edu
- This lesson maps to: Lecture 14 (Generative Models 2: Diffusion).
Attribution (Clawdemy-authored): Stanford CS231n: Deep Learning for Computer Vision, Fei-Fei Li, Ehsan Adeli, and Justin Johnson, Stanford University (cs231n.stanford.edu). CS231n does not publish a required citation string; this is the attribution Clawdemy uses.
A note on access and license
Section titled “A note on access and license”The current term’s lecture recordings are posted on Canvas for enrolled Stanford students. Recordings from previous years are publicly available on YouTube under YouTube’s standard license; Clawdemy links out rather than embedding or rehosting. The course notes (cs231n.github.io) and site are Stanford’s. No Creative Commons license is published for the lectures, so we treat them as link-only references.
Primary papers (cited by name and venue)
Section titled “Primary papers (cited by name and venue)”Diffusion foundations
Section titled “Diffusion foundations”- DDPM. Ho, Jain, Abbeel, “Denoising Diffusion Probabilistic Models” (NeurIPS 2020). The paper that made modern diffusion models work; the formulation used throughout this lesson.
- DDIM (faster sampling). Song, Meng, Ermon, “Denoising Diffusion Implicit Models” (ICLR 2021). Deterministic sampler producing good samples in dramatically fewer steps.
- Score-based generative modeling. Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, “Score-Based Generative Modeling through Stochastic Differential Equations” (ICLR 2021). The continuous-time framework that diffusion can also be derived from.
- Earlier work. Sohl-Dickstein, Weiss, Maheswaranathan, Ganguli, “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” (ICML 2015). The original diffusion-style paper; less well-known but foundational.
Latent diffusion
Section titled “Latent diffusion”- Latent Diffusion Models (Stable Diffusion). Rombach, Blattmann, Lorenz, Esser, Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models” (CVPR 2022). The architecture that made text-to-image affordable to run; the open-source release became Stable Diffusion.
Text-to-image diffusion systems
Section titled “Text-to-image diffusion systems”- GLIDE. Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” (ICML 2022). Earlier OpenAI text-to-image diffusion.
- DALL-E 2. Ramesh, Dhariwal, Nichol, Chu, Chen, “Hierarchical Text-Conditional Image Generation with CLIP Latents” (arXiv 2022). The DALL-E 2 paper (unCLIP).
- Imagen. Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding” (NeurIPS 2022). Google’s high-resolution text-to-image system.
- DALL-E 3. Betker et al., “Improving Image Generation with Better Captions” (arXiv 2023). The technical report behind DALL-E 3.
- Classifier-free guidance. Ho, Salimans, “Classifier-Free Diffusion Guidance” (NeurIPS 2021 workshop / arXiv 2022). The technique behind the “guidance scale” knob in most text-to-image UIs.
Control + conditioning
Section titled “Control + conditioning”- ControlNet. Zhang, Rao, Agrawala, “Adding Conditional Control to Text-to-Image Diffusion Models” (ICCV 2023). The dominant approach for structural conditioning (depth, edges, pose, etc.) added to a pre-trained text-to-image diffusion model.
Further study (deeper mechanics in sister tracks)
Section titled “Further study (deeper mechanics in sister tracks)”- T19 (planned, generative modeling). Will cover the variational interpretation of diffusion, the equivalence with score-based generative models, the closed-form forward process derivation, and the rigorous training-objective derivation. The right destination if you want to fully understand diffusion’s math.
- T24 (planned, image generation). Will cover production text-to-image pipelines end to end, including data preparation, training tricks, classifier-free guidance tuning, latent-diffusion specifics, and deployment considerations. The right destination if you want to actually train or fine-tune a diffusion-based image generator.
- CS231n’s full Lec 14 slides are integrated into the course site (Canvas for current term, YouTube for prior years).
Further study (production tooling)
Section titled “Further study (production tooling)”- Hugging Face diffusers library: the most-used open-source implementation of diffusion models in production, with consistent APIs across Stable Diffusion variants, DALL-E-style approaches, and ControlNet.
- The Stable Diffusion model weights (CompVis, Stability AI, follow-on releases) are openly available and pair with the diffusers library for reproduction and experimentation.
How we use this source
Section titled “How we use this source”Clawdemy follows CS231n’s Lec 14 ordering (the two-direction setup, training, inference, modern variants, conditioning) and stays at vision-applied-intuition level per the Track 16 Phase 0 arc (deep derivations deferred to T19 and T24 as named above). The forward-step formula x_t = sqrt(1 - β_t) · x_{t-1} + sqrt(β_t) · ε and the MSE training-loss form are canonical. The worked forward-step examples (body: x_{t-1} = 0.8, β_t = 0.1, ε = -0.3 → x_t ≈ 0.664; practice: x_{t-1} = 0.5, β_t = 0.04, ε = 1.5 → x_t ≈ 0.790) are Clawdemy-authored against the standard formula. The classifier-free-guidance trajectory-reasoning exercise in practice is Clawdemy-authored to make the prompt-adherence-vs-naturalness trade-off concrete. We do not reproduce CS231n’s slides, figures, problem sets, or lecture text. Full attribution policy: see Doc/attribution-policy.md.