Skip to content

References: Score matching and score-based generation

Source curricula (multi-source structural mirror; cited as further study):
PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 13: Score Based Models
Instructor: Stefano Ermon
Course URL: https://deepgenerativemodels.github.io/
Syllabus: https://deepgenerativemodels.github.io/syllabus.html
License: standard course-page link-out; cited as further study
SECONDARY (CS294-158 covers score matching within its diffusion lecture)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024), Lecture 6: Diffusion Models
Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
License: standard course-page link-out; cited as further study
Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.

A short, durable list. Each link is a specific next step, not a generic pile.

  • “Generative Modeling by Estimating Gradients of the Data Distribution” (Song and Ermon, 2019; NCSN). The NCSN paper that introduced multi-noise-level denoising score matching with annealed Langevin sampling, the direct bridge to modern diffusion. Section 3 derives the multi-scale training objective; Section 4 derives the annealed sampling procedure. This is the conceptual predecessor of the diffusion models that dominate L12-L14.

  • “Improved Techniques for Training Score-Based Generative Models” (Song and Ermon, 2020). NCSN’s practical follow-up. Addresses many of the engineering pathologies (noise schedule choice, exponential moving average of weights, score-norm balancing across scales) that the original NCSN paper left as open problems. Read after the NCSN paper to see the engineering-to-production transition.

  • “Estimation of Non-Normalized Statistical Models by Score Matching” by Aapo Hyvärinen (2005). The original score-matching paper, published in JMLR. The integration-by-parts derivation that turns the unknown-data-score objective into a tractable form is in Theorem 1. The paper is short and approachable; reading the proof is a useful exercise. (Published in Journal of Machine Learning Research; available through the JMLR archives at jmlr.org.)

  • “A Connection Between Score Matching and Denoising Autoencoders” by Pascal Vincent (2011). The paper that derives denoising score matching and shows it is equivalent to denoising-autoencoder training. The closed-form Gaussian-noise score -ε/σ derivation is the conceptual move that makes practical training feasible. (Published in Neural Computation.)

Where this sits in the track.

  • Energy-based models, the partition-function problem (previous lesson). L10 ended with the score-function escape: ∇_x log p_θ(x) = -∇_x E_θ(x), with Z absent. This lesson is what you do with that observation: train the score function directly, without ever computing Z. The L10→L11 arc is a chapter; L10 names the obstacle, L11 names the escape.

  • Diffusion models I (next lesson, L12). Diffusion models are derived from a different perspective (Markov chain of noising steps) but train on an objective that is mathematically equivalent to multi-noise-level denoising score matching. L12 builds the diffusion model directly; L14 makes the score-matching ↔ diffusion equivalence explicit.

  • Maximum likelihood and the KL view (lesson 3). L3 established forward-KL = empirical NLL as the natural training objective when likelihood is available. Score matching is the natural training objective when likelihood is NOT directly available (because Z is intractable). The L3 framework’s “cross-paradigm sharing” table from L9 listed diffusion as “indirect” for likelihood; this lesson is the precise reason, since score-based training does not need likelihood but also does not give one directly.

  • The four-paradigm landscape (lesson 15). The capstone returns to the L1 map with score-based and diffusion as the fully-built fourth paradigm. The score-function-as-noise-predictor identity from this lesson and the multi-noise-level extension are the pieces L15 uses to place modern diffusion systems (Stable Diffusion, DALL-E 3, Sora) on the four-paradigm map.