Skip to content

Summary: Energy-based models, the partition-function problem

Phase 3 opens here. The previous four paradigms each had a specific architectural constraint (causality, invertibility, encoder-decoder structure, two-network game). An energy-based model has none, which makes it the most architecturally permissive paradigm in the track, in exchange for one fatal computational problem. The whole lesson reduces to one line: an EBM names a free neural-network energy E_θ(x) and divides by an intractable partition function Z(θ), which blocks maximum likelihood; the partition function vanishes under the x-gradient, so the score function ∇_x log p_θ = -∇_x E_θ is computable directly, which is the conceptual escape that the next lesson (score matching) and the modern diffusion paradigm build on. This is the scan-it-in-five-minutes version.

  • An energy-based model defines p_θ(x) = exp(-E_θ(x)) / Z(θ) where E_θ(x) is any neural network mapping x to a real number, and Z(θ) = integral over x' of exp(-E_θ(x')) dx' is the partition function. Sign convention: low energy = high probability (Boltzmann). The architectural freedom on E_θ (no causality, invertibility, encoder-decoder, or two-network requirement) is the paradigm’s main appeal.
  • Z(θ) is intractable for any nontrivial neural-network energy. The integral is over the full d-dimensional data space; grid integration is infeasible past ~5D; naive Monte Carlo hits low-probability regions overwhelmingly often; importance sampling needs a proposal close to p_θ. Maximum likelihood (which needs log p = -E − log Z) is blocked.
  • One tractable special case: quadratic 1D energy E(x) = 0.5(x−μ)² gives Z = sqrt(2π) and p(x) = N(μ, 1), a standard Gaussian. Replace the quadratic with anything richer (quartic, neural network) and Z loses its closed form.
  • The maximum-likelihood gradient is ∇_θ log p_θ(x) = -∇_θ E_θ(x) + E_{x' ~ p_θ}[∇_θ E_θ(x')]. Positive phase lowers energy at the data point; negative phase raises energy at samples from the model. The negative phase needs MCMC samples (Langevin dynamics: gradient pull + Gaussian noise); convergence is slow. Contrastive divergence (CD-k) runs MCMC for only k steps from data, biased but cost-bounded. Training-stability pathologies persist.
  • The score-function escape: ∇_x log p_θ(x) = -∇_x E_θ(x), because Z(θ) does not depend on x and vanishes under the x-gradient. The score function is computable from one backward pass through the energy network with NO knowledge of Z. This is the conceptual move that opens the door to score matching (next lesson) and the modern diffusion paradigm (lessons 12-14).
  • Why EBMs are worth knowing: they unify classical model families (RBMs, Ising models, Hopfield, energy-view CRFs); they are the conceptual bridge to diffusion (without this lesson, “why does score matching exist?” has no answer); the “energy” framing recurs across deep learning (contrastive losses, some SSL methods, DPO-family preference learning are EBMs in disguise).
  • Why not the practical paradigm: MCMC + CD training is slow and unstable compared to other paradigms; generation requires iterative Langevin, not a single forward pass. Modern systems use the score-based view that recovers EBM flexibility without the MCMC cost.

Before this lesson, the modern diffusion paradigm probably felt like an alien arrival without a clear motivation: why do we work with ∇_x log p(x) rather than p(x) directly? Now you have the answer: because p(x) = exp(-E)/Z has an intractable Z that ruins maximum likelihood, but ∇_x log p(x) = -∇_x E(x) cleanly drops Z and is computable in one backward pass. When you next read “the model learns the score function” or “we estimate ∇_x log p_t(x) at each noise level,” you will know exactly which obstacle the score function is dodging. The next lesson formalizes the score-matching training objective; the three diffusion lessons after that build the full modern paradigm on top.