References: Energy-based models, the partition-function problem
Source material
Section titled “Source material”Source curricula (multi-source structural mirror; cited as further study):
PRIMARY (this lesson follows its framing most directly)• Stanford CS236, "Deep Generative Models", Lecture 11: Energy Based Models Instructor: Stefano Ermon Course URL: https://deepgenerativemodels.github.io/ Syllabus: https://deepgenerativemodels.github.io/syllabus.html License: standard course-page link-out; cited as further study
SECONDARY (CS294-158's energy-based-models material is distributed across thelecture set rather than concentrated in one dedicated lecture)• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024) Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/ License: standard course-page link-out; cited as further study
Clawdemy's lessons are original prose that follows the pedagogical arc of thesetwo courses, anchored on CS236's lecture order with CS294-158 framing pulled inwhere its slide deck and recording are stronger. We do not reproduce ortranscribe the lectures; we cite them as the recommended companions. All rightsto the original course materials remain with the respective instructors andinstitutions.Watch this next
Section titled “Watch this next”-
Stanford CS236 (Stefano Ermon), course homepage. Lecture 11 (Energy Based Models) is the primary anchor; it covers the EBM definition, the partition-function obstacle, the maximum-likelihood gradient with positive and negative phases, and contrastive divergence. Lecture 12 continues with the practical training methods.
-
Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. CS294-158 covers EBM material across lectures rather than in one dedicated slot; the implicit-models lecture (L5) and the diffusion lecture (L6) both touch the EBM framework when motivating their respective alternatives.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“A Tutorial on Energy-Based Learning” by Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fu Jie Huang (2006). The classic LeCun-led tutorial that introduced the modern “energy-based learning” framing for a broad ML audience. Older than the modern deep-learning revival, but the conceptual framework (energy functions, loss functionals, the role of contrastive learning) is largely unchanged. Available through LeCun’s NYU publications page; worth reading after this lesson to see EBM’s organizing principles laid out in book-chapter form.
-
“Implicit Generation and Modeling with Energy-Based Models” (Du and Mordatch, 2019). The paper that revived EBMs for the modern deep-learning era using Langevin-dynamics sampling. Demonstrates EBM training on CIFAR and ImageNet with practical engineering tricks (replay buffer for MCMC chains, gradient clipping). Read after this lesson to see what production-grade EBM training looks like.
-
“A Connection Between Score Matching and Denoising Autoencoders” by Pascal Vincent (2011). The paper that connects score matching (next lesson) to denoising autoencoders, which in turn connects to the modern diffusion paradigm. Published in Neural Computation; the underlying observation (denoising = estimating a score) is what makes diffusion models work. Preview reading for L11.
Adjacent topics
Section titled “Adjacent topics”Where this sits in the track.
-
The four-paradigm landscape (lesson 1). Lesson 1 named energy-based models as a related-but-not-listed paradigm in the closer’s footnote (EBMs combine “explicit density with implicit normalization” in a way that does not fit cleanly into the four-paradigm map’s categories). This lesson opens up the EBM framework explicitly; lesson 15 returns to the map and places score-based and diffusion (the modern EBM descendants) on it.
-
Maximum likelihood and the KL view (lesson 3). L3’s NLL training objective is exactly what
Zblocks. The L3 cross-paradigm table listed energy-based models as a separate case from forward-KL minimization; this lesson is the explicit derivation of why (the negative-phase term requires MCMC, which CD-k only approximates). -
Score matching and score-based generation (next lesson, L11). L11 is the direct payoff of this lesson’s “Z vanishes under x-gradient” observation. Score matching trains a model to estimate the score function
∇_x log p_θ(x) = -∇_x E_θ(x)(a vector field on the data space), bypassing the partition function entirely. The diffusion lessons in L12-L14 build on the score-matching framework. -
Diffusion models I-III (lessons 12-14). Diffusion models can be derived two ways: as a hierarchical latent-variable ELBO (the L5 framework, extended to many latents), or as a continuous-time score-matching procedure that estimates
∇_x log p_t(x)at each noise level (the score-based framework derived from this lesson and the next). The two derivations turn out to be equivalent, which lesson 14 makes explicit.