Score-based diffusion via SDEs: brief

What you’ll learn

This is the theoretical capstone of Phase 3. Lessons 11, 12, and 13 derived the same trained network (the noise predictor) and the same training loss (the noise-prediction mean-squared error) from three different starting points (denoising score matching, the DDPM Markov-chain ELBO, and the DDIM non-Markovian sampler). They look like the same equation in different costumes. This lesson is what makes them formally the same equation. By the end you will be able to write the forward stochastic differential equation that the DDPM Markov chain is a discretization of, connect the noise predictor to the score function via a fixed-scalar relationship, analyze the reverse SDE whose drift involves the score function, derive the probability flow ODE that gives a deterministic sampler (the L13 DDIM mechanism formally) and a tractable likelihood (the answer to L9’s “indirect” entry for diffusion), and place L11 through L14 on one unified map. The primary source is Stanford CS236 Lectures 13 and 16 (Stefano Ermon), where the score-based and SDE framings are introduced; Berkeley CS294-158 Lecture 6 (Pieter Abbeel et al.) covers the same material from the diffusion-paradigm side.

Where this fits

This is lesson 14 of 15 and the closing math lesson of the track. It unifies what the three previous Phase 3 lessons built. The capstone at lesson 15 returns to L1’s four-paradigm map with all the math filled in. After this lesson, every model you have read about across the track has a precise mathematical home in one of four buckets, with this lesson’s continuous-time framework serving as the formal description of the score-based / diffusion bucket.

Before you start

Prerequisites: lessons 11, 12, and 13. The SDE framing in this lesson assumes you have the L11 denoising-score-matching framework, the L12 DDPM Markov chain plus closed-form forward shortcut, and the L13 DDIM deterministic sampler. The math here is denser than L13 (continuous-time SDEs, the probability flow ODE, the Anderson reverse-SDE result) but every piece references something you have already seen. Comfort with calculus (the limit interpretation of the discrete chain) and with random-process intuition (the Wiener process as the continuous-time analogue of Gaussian noise per step) makes the reading easier; if you are unsure on either, the worked numerical anchor in this lesson is the way to ground the abstractions.

About the math

This is the most theoretically dense lesson in Phase 3, and the smallest practical leap. The forward and reverse SDEs are one notational step away from the discretized chain you already know; the noise-predictor / score-function identity is one line of algebra from the closed-form forward shortcut; the probability flow ODE is the deterministic limit of the reverse SDE with a constant change. Each result is a clean line of reasoning, but the framework as a whole is what makes diffusion a coherent paradigm rather than three independent recipes. The worked numerical anchor takes a simple linear-drift constant-diffusion variance-preserving SDE and integrates the probability flow ODE through one step end-to-end, with numbers preserved exactly.

By the end, you’ll be able to

Derive the forward stochastic differential equation that the DDPM Markov chain is a discretization of, and identify the variance-preserving and variance-exploding choices
Connect the noise predictor from L12 to the score function from L11 via the fixed-scalar relationship (negative noise prediction divided by cumulative noise standard deviation)
Analyze why the reverse process is itself an SDE whose drift involves the score function, and explain why integrating it backward produces samples from the data distribution
Derive the probability flow ODE associated with the reverse SDE, and explain why it gives both a deterministic sampler (the L13 DDIM mechanism) and a tractable likelihood evaluation
Place L11, L12, L13, and L14 on one map and explain why three different recipes are the same continuous-time framework with different discretizations and different samplers

Time and difficulty

Read time: about 17 minutes
Practice time: about 16 minutes (a six-question self-check, a one-step probability-flow-ODE integration on a linear-drift constant-diffusion SDE with numbers preserved exactly, a noise-predictor-to-score conversion at three timesteps, and flashcards)
Difficulty: standard, theoretical-capstone-dense. Most math-heavy lesson in Phase 3. The framework is what to remember; the specific equations are the calibration for reading score-based papers fluently.