Skip to content

References: Latent variables and the ELBO

Source curricula (multi-source structural mirror; cited as further study):
PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 5: Variational Autoencoders
Instructor: Stefano Ermon
Course URL: https://deepgenerativemodels.github.io/
Syllabus: https://deepgenerativemodels.github.io/syllabus.html
License: standard course-page link-out; cited as further study
SECONDARY (also contributed to this lesson's framing)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024), Lecture 4: Latent Variable Models
Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
License: standard course-page link-out; cited as further study
Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.

A short, durable list. Each link is a specific next step, not a generic pile.

  • “Auto-Encoding Variational Bayes” (Kingma, Welling, 2013). The original VAE paper, the canonical foundational reference for this paradigm. Section 2 introduces the variational lower bound (the ELBO) in essentially the form derived in this lesson, and Section 3 introduces the reparameterization trick that the next lesson will cover. Worth reading the introduction and Section 2 even before lesson 6; the paper is famously crisp.

  • “An Introduction to Variational Autoencoders” (Kingma, Welling, 2019). A book-length expansion of the original VAE paper by the same authors. Chapter 2 walks through several alternative ELBO derivations (Jensen’s inequality, importance sampling, free energy in physics, the EM connection), which is useful if a single derivation feels narrow.

Where this sits in the track.

  • Maximum likelihood and the KL view (lesson 3). The ELBO is the latent-variable paradigm’s response to the same forward-KL minimization L3 derived. L3’s NLL objective is intractable when the marginal p_model(x) involves an integral; the ELBO is the tractable lower bound we maximize instead. The gap identity log p(x) - ELBO = KL(q || p_posterior) is itself a KL, so the latent-variable paradigm is still fundamentally about KL minimization, just with a bound replacing the exact objective.

  • VAE training in practice, the reparameterization trick (next lesson). Lesson 6 takes the ELBO from this abstract derivation to a concrete neural-network architecture. The reparameterization trick makes the encoder’s stochastic sampling differentiable, which is what allows the ELBO to be trained by SGD. Same math, real architecture.

  • Diffusion models (lessons 12-14). Surprisingly, the diffusion training objective is mathematically equivalent to a particular ELBO derived from a multi-step latent-variable model where the latents are the noisy intermediate states. Lesson 14 makes this connection explicit. The ELBO machinery you build here is reusable in Phase 3.

  • Normalizing flows (previous lesson). Flows and VAEs both have an encoder-decoder shape: flows transform a base distribution into the data; VAEs encode data into a latent and decode it back. The difference is the architectural constraint: flows require invertibility (exact likelihood, bounded flexibility); VAEs relax it (flexible architecture, ELBO bound instead of exact likelihood). The two paradigms are points on a trade-off curve.