Skip to content

Energy-based models, the partition-function problem

This is lesson 10 of Track 19 (Generative Models and Diffusion), and it opens Phase 3 (energy-score-diffusion). By the end you will be able to write the energy-based model density, explain why the partition function is intractable for any nontrivial neural-network energy, derive the maximum-likelihood gradient and identify the “negative-phase” term that requires sampling from the model, describe Langevin dynamics and contrastive divergence as the standard MCMC-based workarounds (with their biases), and derive the score function and explain why the partition function vanishes under the x-gradient. The score-function observation is the conceptual move that opens the score-matching framework (next lesson) and the modern diffusion paradigm (lessons 12-14). The source curriculum is Stanford CS236 Lecture 11.

This is lesson 10 of 15, the first lesson of Phase 3 (energy, score, diffusion). It is the conceptual bridge from the four paradigms in Phase 1 and Phase 2 to the modern score-based view that dominates Phase 3. The next lesson, Score matching and score-based generation, derives the training objective that builds directly on the score-function observation here; lessons 12-14 then construct full diffusion models as a multi-step score-matching procedure across noise levels.

Prerequisites: all of Phase 1 (especially L3’s KL/forward-KL framework, since EBM maximum likelihood and its failure are the L3 objective applied to a paradigm where it does not work) and Phase 2 (especially the cross-paradigm map’s likelihood column). Math background: comfort with expectations, gradient calculus, and one move in vector calculus (differentiating an integral with respect to a parameter). No new probability concepts beyond what Phase 1 introduced.

This lesson is denser than L9 (which was conceptual) but lighter than L8 (which had transport-theory machinery). The key derivations are: the maximum-likelihood gradient (introduces the positive/negative-phase split via differentiation of log Z(θ)), and the score-function derivation (one calculus step showing Z vanishes under the x-gradient). A worked Gaussian-energy example shows when Z is tractable; the practice extends to a derivation showing Z vanishes under ∇_x. No matrix algebra; the math stays at scalar-function and gradient-vector level.

  • Write the EBM density p_θ(x) = exp(-E_θ(x))/Z(θ) and explain why the partition function Z(θ) is intractable for any nontrivial neural-network energy
  • Derive the maximum-likelihood gradient with its positive and negative phases, and identify the negative phase as the term that requires MCMC sampling from the model
  • Describe Langevin dynamics for sampling from an EBM and contrastive divergence as the standard pragmatic compromise (and its biases)
  • Derive the score function ∇_x log p_θ(x) = -∇_x E_θ(x) and explain why the partition function vanishes under the x-gradient but not the θ-gradient
  • Connect the score-function observation to the next lesson (score matching) and to the modern diffusion paradigm (lessons 12-14)
  • Read time: about 14 minutes
  • Practice time: about 16 minutes (a six-question self-check, a worked Gaussian-energy tractability example, a derivation showing the partition function vanishes under the x-gradient, and flashcards)
  • Difficulty: standard (a Phase 3 lesson; two clean derivations, one new sampling concept, no §6 watch since this is pure technique)