References: Latent variables and the ELBO

Source material

Source curricula (multi-source structural mirror; cited as further study):

PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 5: Variational Autoencoders
  Instructor: Stefano Ermon
  Course URL: https://deepgenerativemodels.github.io/
  Syllabus: https://deepgenerativemodels.github.io/syllabus.html
  License: standard course-page link-out; cited as further study

SECONDARY (also contributed to this lesson's framing)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024), Lecture 4: Latent Variable Models
  Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
  Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
  License: standard course-page link-out; cited as further study

Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.

Watch this next

Stanford CS236 (Stefano Ermon), course homepage. Lecture 5 (Variational Autoencoders) is the primary anchor; the ELBO derivation in the lecture uses the same Jensen’s-inequality move this lesson walks through. The course notes at deepgenerativemodels.github.io/notes include a written treatment with worked algebra on the gap identity.
Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. Lecture 4 (Latent Variable Models) is the secondary anchor. Its slide deck is especially clear on the relationship between the ELBO’s two terms and what each term penalizes when training fails.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“Auto-Encoding Variational Bayes” (Kingma, Welling, 2013). The original VAE paper, the canonical foundational reference for this paradigm. Section 2 introduces the variational lower bound (the ELBO) in essentially the form derived in this lesson, and Section 3 introduces the reparameterization trick that the next lesson will cover. Worth reading the introduction and Section 2 even before lesson 6; the paper is famously crisp.
“An Introduction to Variational Autoencoders” (Kingma, Welling, 2019). A book-length expansion of the original VAE paper by the same authors. Chapter 2 walks through several alternative ELBO derivations (Jensen’s inequality, importance sampling, free energy in physics, the EM connection), which is useful if a single derivation feels narrow.

Adjacent topics

Where this sits in the track.

Maximum likelihood and the KL view (lesson 3). The ELBO is the latent-variable paradigm’s response to the same forward-KL minimization L3 derived. L3’s NLL objective is intractable when the marginal p_model(x) involves an integral; the ELBO is the tractable lower bound we maximize instead. The gap identity log p(x) - ELBO = KL(q || p_posterior) is itself a KL, so the latent-variable paradigm is still fundamentally about KL minimization, just with a bound replacing the exact objective.
VAE training in practice, the reparameterization trick (next lesson). Lesson 6 takes the ELBO from this abstract derivation to a concrete neural-network architecture. The reparameterization trick makes the encoder’s stochastic sampling differentiable, which is what allows the ELBO to be trained by SGD. Same math, real architecture.
Diffusion models (lessons 12-14). Surprisingly, the diffusion training objective is mathematically equivalent to a particular ELBO derived from a multi-step latent-variable model where the latents are the noisy intermediate states. Lesson 14 makes this connection explicit. The ELBO machinery you build here is reusable in Phase 3.
Normalizing flows (previous lesson). Flows and VAEs both have an encoder-decoder shape: flows transform a base distribution into the data; VAEs encode data into a latent and decode it back. The difference is the architectural constraint: flows require invertibility (exact likelihood, bounded flexibility); VAEs relax it (flexible architecture, ELBO bound instead of exact likelihood). The two paradigms are points on a trade-off curve.