Summary: Latent variables and the ELBO

Phase 2 opens with the latent-variable paradigm. Unlike the three Phase-1 paradigms (autoregressive, MLE/KL, flow), it cannot compute log p_model(x) exactly: the marginal p_model(x) = integral over z of p(x | z) p(z) dz is intractable. The mathematical response is the evidence lower bound (ELBO), and the whole lesson reduces to one line: the ELBO is the tractable lower bound on log p_model(x) that we maximize when the marginal integral is intractable; it splits into reconstruction minus KL-to-prior, and the gap to the true log-likelihood is itself a KL from the variational posterior to the true posterior, which maximizing the ELBO automatically closes. This is the scan-it-in-five-minutes version.

Core ideas

A latent-variable model says x is generated by drawing a hidden code z ~ p(z) (a simple prior) and then x ~ p(x | z) (a learned decoder). The model’s marginal p_model(x) = integral over z of p(x | z) p(z) dz is intractable for a neural-network decoder: the integral has no closed form, and naive Monte Carlo fails because most random latents give tiny p(x | z).
To get a trainable objective, we introduce a variational distribution q(z | x) (the encoder, learnable), then apply Jensen’s inequality to bound log p(x) below by an expectation that we CAN compute. Two lines: log p(x) = log E_{q(z|x)}[p(x,z) / q(z|x)] >= E_{q(z|x)}[ log(p(x,z) / q(z|x)) ] = ELBO(x; q). The bound is below the truth because log is concave.
Factor p(x, z) = p(x | z) p(z) and the ELBO splits into two interpretable terms: ELBO(x; q) = E_{q(z|x)}[ log p(x | z) ] - KL( q(z | x) || p(z) ). Reconstruction (the decoder fits x from encoder-sampled latents) minus KL regularizer (the encoder stays close to the prior). They pull in opposite directions; training balances them.
The gap identity: log p(x) - ELBO(x; q) = KL( q(z | x) || p(z | x) ). The gap between the bound and the truth is the KL from the variational posterior (encoder) to the true posterior. Maximizing the ELBO does TWO good things at once: pushes log p(x) up AND pushes q(z | x) toward p(z | x). The bound is tight (gap = 0) exactly when q = p_posterior.
Worked binary anchor. Binary x, z, prior [0.5, 0.5], decoder p(x=1|z=0)=0.2, p(x=1|z=1)=0.8, so p(x=1) = 0.5 and log p(x=1) ≈ -0.6931. True posterior p(z=1|x=1) = 0.8. With imperfect encoder q(z=1|x=1) = 0.7: reconstruction ≈ -0.6390, KL-to-prior ≈ 0.0823, ELBO ≈ -0.7214, gap ≈ 0.028 = KL(q || true posterior) ≈ 0.028 to rounding. Setting q = 0.8 (true posterior) gives ELBO exactly -0.6931 = log p(x=1): the bound is tight.
Cross-paradigm consequence. A VAE’s reported “likelihood” is the ELBO, a lower bound on log p_model(x). Quoting it as a likelihood understates the model; cross-paradigm comparisons with autoregressive models (which give exact likelihood) require care. This is why the L3 cheatsheet listed VAEs as “lower bound (ELBO)” rather than exact.
The ELBO is reusable. The next lesson takes it to a real architecture (VAE + the reparameterization trick that makes it trainable by SGD); the diffusion lessons (12-14) will show the diffusion training objective is mathematically equivalent to a particular ELBO over noisy intermediate states. The machinery you build here is the backbone of Phase 2 and a hidden ingredient in Phase 3.

What changes for you

Before this lesson, the difference between an autoregressive model and a VAE was probably “VAE uses an encoder-decoder” without a precise statement of what the encoder is doing or why VAE training is harder. Now you have it: the encoder is a variational approximation to the true (intractable) posterior, and VAE training maximizes a tractable lower bound (the ELBO) that becomes tight as the encoder gets better. When you next see a VAE training-loss curve, you can read both terms; when you see “posterior collapse” or “beta-VAE” mentioned, you can place them precisely as a KL-term problem and a KL-term reweighting. The next lesson makes this concrete: real neural-network encoder and decoder, the reparameterization trick for trainability, and the VAE on real data.