Skip to content

References: Energy-based models, the partition-function problem

Source curricula (multi-source structural mirror; cited as further study):
PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 11: Energy Based Models
Instructor: Stefano Ermon
Course URL: https://deepgenerativemodels.github.io/
Syllabus: https://deepgenerativemodels.github.io/syllabus.html
License: standard course-page link-out; cited as further study
SECONDARY (CS294-158's energy-based-models material is distributed across the
lecture set rather than concentrated in one dedicated lecture)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024)
Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
License: standard course-page link-out; cited as further study
Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.
  • Stanford CS236 (Stefano Ermon), course homepage. Lecture 11 (Energy Based Models) is the primary anchor; it covers the EBM definition, the partition-function obstacle, the maximum-likelihood gradient with positive and negative phases, and contrastive divergence. Lecture 12 continues with the practical training methods.

  • Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. CS294-158 covers EBM material across lectures rather than in one dedicated slot; the implicit-models lecture (L5) and the diffusion lecture (L6) both touch the EBM framework when motivating their respective alternatives.

A short, durable list. Each link is a specific next step, not a generic pile.

  • “A Tutorial on Energy-Based Learning” by Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fu Jie Huang (2006). The classic LeCun-led tutorial that introduced the modern “energy-based learning” framing for a broad ML audience. Older than the modern deep-learning revival, but the conceptual framework (energy functions, loss functionals, the role of contrastive learning) is largely unchanged. Available through LeCun’s NYU publications page; worth reading after this lesson to see EBM’s organizing principles laid out in book-chapter form.

  • “Implicit Generation and Modeling with Energy-Based Models” (Du and Mordatch, 2019). The paper that revived EBMs for the modern deep-learning era using Langevin-dynamics sampling. Demonstrates EBM training on CIFAR and ImageNet with practical engineering tricks (replay buffer for MCMC chains, gradient clipping). Read after this lesson to see what production-grade EBM training looks like.

  • “A Connection Between Score Matching and Denoising Autoencoders” by Pascal Vincent (2011). The paper that connects score matching (next lesson) to denoising autoencoders, which in turn connects to the modern diffusion paradigm. Published in Neural Computation; the underlying observation (denoising = estimating a score) is what makes diffusion models work. Preview reading for L11.

Where this sits in the track.

  • The four-paradigm landscape (lesson 1). Lesson 1 named energy-based models as a related-but-not-listed paradigm in the closer’s footnote (EBMs combine “explicit density with implicit normalization” in a way that does not fit cleanly into the four-paradigm map’s categories). This lesson opens up the EBM framework explicitly; lesson 15 returns to the map and places score-based and diffusion (the modern EBM descendants) on it.

  • Maximum likelihood and the KL view (lesson 3). L3’s NLL training objective is exactly what Z blocks. The L3 cross-paradigm table listed energy-based models as a separate case from forward-KL minimization; this lesson is the explicit derivation of why (the negative-phase term requires MCMC, which CD-k only approximates).

  • Score matching and score-based generation (next lesson, L11). L11 is the direct payoff of this lesson’s “Z vanishes under x-gradient” observation. Score matching trains a model to estimate the score function ∇_x log p_θ(x) = -∇_x E_θ(x) (a vector field on the data space), bypassing the partition function entirely. The diffusion lessons in L12-L14 build on the score-matching framework.

  • Diffusion models I-III (lessons 12-14). Diffusion models can be derived two ways: as a hierarchical latent-variable ELBO (the L5 framework, extended to many latents), or as a continuous-time score-matching procedure that estimates ∇_x log p_t(x) at each noise level (the score-based framework derived from this lesson and the next). The two derivations turn out to be equivalent, which lesson 14 makes explicit.