Energy-based models, the partition-function problem
What you’ll learn
Section titled “What you’ll learn”This is lesson 10 of Track 19 (Generative Models and Diffusion), and it opens Phase 3 (energy-score-diffusion). By the end you will be able to write the energy-based model density, explain why the partition function is intractable for any nontrivial neural-network energy, derive the maximum-likelihood gradient and identify the “negative-phase” term that requires sampling from the model, describe Langevin dynamics and contrastive divergence as the standard MCMC-based workarounds (with their biases), and derive the score function and explain why the partition function vanishes under the x-gradient. The score-function observation is the conceptual move that opens the score-matching framework (next lesson) and the modern diffusion paradigm (lessons 12-14). The source curriculum is Stanford CS236 Lecture 11.
Where this fits
Section titled “Where this fits”This is lesson 10 of 15, the first lesson of Phase 3 (energy, score, diffusion). It is the conceptual bridge from the four paradigms in Phase 1 and Phase 2 to the modern score-based view that dominates Phase 3. The next lesson, Score matching and score-based generation, derives the training objective that builds directly on the score-function observation here; lessons 12-14 then construct full diffusion models as a multi-step score-matching procedure across noise levels.
Before you start
Section titled “Before you start”Prerequisites: all of Phase 1 (especially L3’s KL/forward-KL framework, since EBM maximum likelihood and its failure are the L3 objective applied to a paradigm where it does not work) and Phase 2 (especially the cross-paradigm map’s likelihood column). Math background: comfort with expectations, gradient calculus, and one move in vector calculus (differentiating an integral with respect to a parameter). No new probability concepts beyond what Phase 1 introduced.
About the math
Section titled “About the math”This lesson is denser than L9 (which was conceptual) but lighter than L8 (which had transport-theory machinery). The key derivations are: the maximum-likelihood gradient (introduces the positive/negative-phase split via differentiation of log Z(θ)), and the score-function derivation (one calculus step showing Z vanishes under the x-gradient). A worked Gaussian-energy example shows when Z is tractable; the practice extends to a derivation showing Z vanishes under ∇_x. No matrix algebra; the math stays at scalar-function and gradient-vector level.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Write the EBM density p_θ(x) = exp(-E_θ(x))/Z(θ) and explain why the partition function Z(θ) is intractable for any nontrivial neural-network energy
- Derive the maximum-likelihood gradient with its positive and negative phases, and identify the negative phase as the term that requires MCMC sampling from the model
- Describe Langevin dynamics for sampling from an EBM and contrastive divergence as the standard pragmatic compromise (and its biases)
- Derive the score function ∇_x log p_θ(x) = -∇_x E_θ(x) and explain why the partition function vanishes under the x-gradient but not the θ-gradient
- Connect the score-function observation to the next lesson (score matching) and to the modern diffusion paradigm (lessons 12-14)
Time and difficulty
Section titled “Time and difficulty”- Read time: about 14 minutes
- Practice time: about 16 minutes (a six-question self-check, a worked Gaussian-energy tractability example, a derivation showing the partition function vanishes under the
x-gradient, and flashcards) - Difficulty: standard (a Phase 3 lesson; two clean derivations, one new sampling concept, no §6 watch since this is pure technique)