References: Score matching and score-based generation

Source material

Source curricula (multi-source structural mirror; cited as further study):

PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 13: Score Based Models
  Instructor: Stefano Ermon
  Course URL: https://deepgenerativemodels.github.io/
  Syllabus: https://deepgenerativemodels.github.io/syllabus.html
  License: standard course-page link-out; cited as further study

SECONDARY (CS294-158 covers score matching within its diffusion lecture)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024), Lecture 6: Diffusion Models
  Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
  Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
  License: standard course-page link-out; cited as further study

Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.

Watch this next

Stanford CS236 (Stefano Ermon), course homepage. Lecture 13 (Score Based Models) is the primary anchor; it covers Hyvärinen’s score-matching objective, denoising score matching, and the NCSN extension. Notes at deepgenerativemodels.github.io/notes include the integration-by-parts derivation step by step.
Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. Lecture 6 covers score matching as part of the diffusion-model derivation; the score-matching framing leads directly into the multi-noise-level diffusion training.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“Generative Modeling by Estimating Gradients of the Data Distribution” (Song and Ermon, 2019; NCSN). The NCSN paper that introduced multi-noise-level denoising score matching with annealed Langevin sampling, the direct bridge to modern diffusion. Section 3 derives the multi-scale training objective; Section 4 derives the annealed sampling procedure. This is the conceptual predecessor of the diffusion models that dominate L12-L14.
“Improved Techniques for Training Score-Based Generative Models” (Song and Ermon, 2020). NCSN’s practical follow-up. Addresses many of the engineering pathologies (noise schedule choice, exponential moving average of weights, score-norm balancing across scales) that the original NCSN paper left as open problems. Read after the NCSN paper to see the engineering-to-production transition.
“Estimation of Non-Normalized Statistical Models by Score Matching” by Aapo Hyvärinen (2005). The original score-matching paper, published in JMLR. The integration-by-parts derivation that turns the unknown-data-score objective into a tractable form is in Theorem 1. The paper is short and approachable; reading the proof is a useful exercise. (Published in Journal of Machine Learning Research; available through the JMLR archives at jmlr.org.)
“A Connection Between Score Matching and Denoising Autoencoders” by Pascal Vincent (2011). The paper that derives denoising score matching and shows it is equivalent to denoising-autoencoder training. The closed-form Gaussian-noise score -ε/σ derivation is the conceptual move that makes practical training feasible. (Published in Neural Computation.)

Adjacent topics

Where this sits in the track.

Energy-based models, the partition-function problem (previous lesson). L10 ended with the score-function escape: ∇_x log p_θ(x) = -∇_x E_θ(x), with Z absent. This lesson is what you do with that observation: train the score function directly, without ever computing Z. The L10→L11 arc is a chapter; L10 names the obstacle, L11 names the escape.
Diffusion models I (next lesson, L12). Diffusion models are derived from a different perspective (Markov chain of noising steps) but train on an objective that is mathematically equivalent to multi-noise-level denoising score matching. L12 builds the diffusion model directly; L14 makes the score-matching ↔ diffusion equivalence explicit.
Maximum likelihood and the KL view (lesson 3). L3 established forward-KL = empirical NLL as the natural training objective when likelihood is available. Score matching is the natural training objective when likelihood is NOT directly available (because Z is intractable). The L3 framework’s “cross-paradigm sharing” table from L9 listed diffusion as “indirect” for likelihood; this lesson is the precise reason, since score-based training does not need likelihood but also does not give one directly.
The four-paradigm landscape (lesson 15). The capstone returns to the L1 map with score-based and diffusion as the fully-built fourth paradigm. The score-function-as-noise-predictor identity from this lesson and the multi-noise-level extension are the pieces L15 uses to place modern diffusion systems (Stable Diffusion, DALL-E 3, Sora) on the four-paradigm map.