Score matching, in brief

What you’ll learn

This is lesson 11 of Track 19 (Generative Models and Diffusion), and it cashes the score-function-escape observation from L10. By the end you will be able to write the explicit score-matching objective and identify the two reasons it does not scale (no access to ∇_x log p_data; Hyvärinen’s integration-by-parts trick gives an equivalent form but with a trace-of-Jacobian cost prohibitive in high dimensions), derive denoising score matching from the Gaussian conditional score and recognize the resulting noise-prediction MSE as the practical training objective, and see that the multi-noise-level extension (NCSN) is mathematically the diffusion training objective the next three lessons will build directly. The source curriculum is Stanford CS236 Lecture 13.

Where this fits

This is lesson 11 of 15, the second lesson of Phase 3 (energy-score-diffusion). It is the direct payoff of L10’s score-function-escape observation: L10 named the obstacle (the partition function blocks maximum likelihood), L11 names the escape (score matching bypasses Z entirely), and the L10-L11 arc forms a complete chapter on “why and how score-based modeling exists.” Lesson 12 opens the diffusion model in its Markov-chain (DDPM) formulation; lesson 13 covers training and sampling in practice with classifier-free guidance; lesson 14 returns to the score-matching view and makes the equivalence with diffusion explicit.

Before you start

Prerequisites: the previous lesson, Energy-based models, the partition-function problem, for the score-function-escape observation this lesson cashes. The L3 KL/cross-entropy machinery is reused implicitly. Math background: comfort with expectations, gradient vector calculus, and one calculus identity (the gradient of log N(x̃; μ, σ²I) with respect to x̃). No new probability concepts beyond what L10 introduced.

About the math

This lesson has more derivations than L10. The key results are: the explicit score-matching loss and its Hyvärinen 2005 equivalent (with the trace-of-Jacobian obstacle); the denoising score-matching loss derived from the Gaussian conditional score; and the worked 1D Gaussian example where explicit SM has a clean closed form. The math density is comparable to L8 (which derived the Wasserstein distance) but with the focus on score functions rather than transport distances.

By the end, you’ll be able to

Write the explicit score-matching objective and explain why we cannot evaluate it directly (we don’t have ∇_x log p_data) and why Hyvärinen’s integration-by-parts trick doesn’t scale (trace-of-Jacobian cost in high dimensions)
Derive the denoising score matching objective from the Gaussian conditional score and explain why it scales (one forward + one backward pass per noised sample)
Recognize the conceptual identity that the score network IS a noise predictor (its output equals the negative scaled noise added)
Compute the explicit score-matching loss for a 1D Gaussian by hand and find the optimal model parameter
Describe the multi-noise-level extension (NCSN) and recognize it as the mathematical equivalent of the diffusion training objective derived from the Markov-chain perspective

Time and difficulty

Read time: about 14 minutes
Practice time: about 16 minutes (a six-question self-check, a 1D explicit score-matching computation finding the optimal parameter, a denoising-score-matching exercise on four worked individual examples, and flashcards)
Difficulty: standard (a Phase 3 lesson; two clean derivations, one new score-matching framework, no §6 watch since this is pure technique)