Energy-based models, in brief

What you’ll learn

This is lesson 10 of Track 19 (Generative Models and Diffusion), and it opens Phase 3 (energy-score-diffusion). By the end you will be able to write the energy-based model density, explain why the partition function is intractable for any nontrivial neural-network energy, derive the maximum-likelihood gradient and identify the “negative-phase” term that requires sampling from the model, describe Langevin dynamics and contrastive divergence as the standard MCMC-based workarounds (with their biases), and derive the score function and explain why the partition function vanishes under the x-gradient. The score-function observation is the conceptual move that opens the score-matching framework (next lesson) and the modern diffusion paradigm (lessons 12-14). The source curriculum is Stanford CS236 Lecture 11.

Where this fits

This is lesson 10 of 15, the first lesson of Phase 3 (energy, score, diffusion). It is the conceptual bridge from the four paradigms in Phase 1 and Phase 2 to the modern score-based view that dominates Phase 3. The next lesson, Score matching and score-based generation, derives the training objective that builds directly on the score-function observation here; lessons 12-14 then construct full diffusion models as a multi-step score-matching procedure across noise levels.

Before you start

Prerequisites: all of Phase 1 (especially L3’s KL/forward-KL framework, since EBM maximum likelihood and its failure are the L3 objective applied to a paradigm where it does not work) and Phase 2 (especially the cross-paradigm map’s likelihood column). Math background: comfort with expectations, gradient calculus, and one move in vector calculus (differentiating an integral with respect to a parameter). No new probability concepts beyond what Phase 1 introduced.

About the math

This lesson is denser than L9 (which was conceptual) but lighter than L8 (which had transport-theory machinery). The key derivations are: the maximum-likelihood gradient (introduces the positive/negative-phase split via differentiation of log Z(θ)), and the score-function derivation (one calculus step showing Z vanishes under the x-gradient). A worked Gaussian-energy example shows when Z is tractable; the practice extends to a derivation showing Z vanishes under ∇_x. No matrix algebra; the math stays at scalar-function and gradient-vector level.

By the end, you’ll be able to

Write the EBM density p_θ(x) = exp(-E_θ(x))/Z(θ) and explain why the partition function Z(θ) is intractable for any nontrivial neural-network energy
Derive the maximum-likelihood gradient with its positive and negative phases, and identify the negative phase as the term that requires MCMC sampling from the model
Describe Langevin dynamics for sampling from an EBM and contrastive divergence as the standard pragmatic compromise (and its biases)
Derive the score function ∇_x log p_θ(x) = -∇_x E_θ(x) and explain why the partition function vanishes under the x-gradient but not the θ-gradient
Connect the score-function observation to the next lesson (score matching) and to the modern diffusion paradigm (lessons 12-14)

Time and difficulty

Read time: about 14 minutes
Practice time: about 16 minutes (a six-question self-check, a worked Gaussian-energy tractability example, a derivation showing the partition function vanishes under the x-gradient, and flashcards)
Difficulty: standard (a Phase 3 lesson; two clean derivations, one new sampling concept, no §6 watch since this is pure technique)