Latent variables and the ELBO: brief

What you’ll learn

This is lesson 5 of Track 19 (Generative Models and Diffusion), and it opens Phase 2 (latent-and-adversarial). The three Phase-1 paradigms all computed log p_model(x) exactly; this lesson introduces a fourth paradigm where the marginal p_model(x) = ∫ p(x|z) p(z) dz is intractable, and the math response is the evidence lower bound (ELBO). By the end you will be able to derive the ELBO in two lines using Jensen’s inequality, split it into its two interpretable terms (reconstruction + KL), explain the gap identity log p(x) − ELBO = KL(q || true posterior), and verify all of this numerically on a small binary example. The lesson is mathematical machinery; the next lesson takes it to a real architecture (the variational autoencoder).

Where this fits

This is lesson 5 of 15, opening Phase 2 (latent-variable and adversarial paradigms). It builds on L3’s forward-KL = NLL framework: the latent-variable paradigm wants the same objective but cannot compute it exactly, so the ELBO is “the closest thing to forward-KL minimization the paradigm allows.” The next lesson, VAE training in practice, takes the ELBO to a concrete encoder-decoder neural network and introduces the reparameterization trick that makes the ELBO trainable by stochastic gradient descent. The ELBO machinery built here is also reused in Phase 3: the diffusion training objective (lessons 12-14) is mathematically equivalent to a particular ELBO over noisy intermediate states.

Before you start

Prerequisites: the previous lesson, Normalizing flows, change of variables for distributions, for the encoder-decoder shape (flows are bijective encoder-decoders; VAEs relax that). And L3 (Maximum likelihood and the KL view) for the forward-KL framework the ELBO is the latent-variable response to. The math background: comfort with expectations, KL divergence (from L3), and one move with Jensen’s inequality. No new calculus is introduced beyond the discrete-sum example.

About the math

The lesson uses one new tool (Jensen’s inequality applied to log, which is concave), plus all the L3 machinery (KL divergence, expectations under different distributions). The derivation is two lines; the consequences are several. A worked binary example uses discrete sums (no integrals) so every step is verifiable by hand, and the practice extends it to a fresh setup so you can run the same identity through different numbers. The math density is comparable to L3.

By the end, you’ll be able to

Define the latent-variable model and explain why log p_model(x) is intractable in this setting
Derive the ELBO in two lines using Jensen’s inequality, starting from the marginal log-likelihood
Decompose the ELBO into reconstruction and KL terms, explaining what each term penalizes and how they trade off
State and verify the gap identity log p(x) - ELBO = KL(q || true posterior), and explain why maximizing the ELBO closes the gap automatically
Recognize why a VAE’s reported ELBO is not directly comparable to an autoregressive model’s exact likelihood

Time and difficulty

Read time: about 14 minutes
Practice time: about 16 minutes (a six-question self-check, a compute-the-ELBO-and-the-gap exercise on a fresh binary case, a verify-the-tight-case drill, and flashcards)
Difficulty: standard (a Phase 2 lesson; the math is one Jensen’s-inequality move and the L3 KL machinery applied to a new setting; arithmetic stays on small discrete examples)