Skip to content

Summary: Latent variables and the ELBO

Phase 2 opens with the latent-variable paradigm. Unlike the three Phase-1 paradigms (autoregressive, MLE/KL, flow), it cannot compute log p_model(x) exactly: the marginal p_model(x) = integral over z of p(x | z) p(z) dz is intractable. The mathematical response is the evidence lower bound (ELBO), and the whole lesson reduces to one line: the ELBO is the tractable lower bound on log p_model(x) that we maximize when the marginal integral is intractable; it splits into reconstruction minus KL-to-prior, and the gap to the true log-likelihood is itself a KL from the variational posterior to the true posterior, which maximizing the ELBO automatically closes. This is the scan-it-in-five-minutes version.

  • A latent-variable model says x is generated by drawing a hidden code z ~ p(z) (a simple prior) and then x ~ p(x | z) (a learned decoder). The model’s marginal p_model(x) = integral over z of p(x | z) p(z) dz is intractable for a neural-network decoder: the integral has no closed form, and naive Monte Carlo fails because most random latents give tiny p(x | z).
  • To get a trainable objective, we introduce a variational distribution q(z | x) (the encoder, learnable), then apply Jensen’s inequality to bound log p(x) below by an expectation that we CAN compute. Two lines: log p(x) = log E_{q(z|x)}[p(x,z) / q(z|x)] >= E_{q(z|x)}[ log(p(x,z) / q(z|x)) ] = ELBO(x; q). The bound is below the truth because log is concave.
  • Factor p(x, z) = p(x | z) p(z) and the ELBO splits into two interpretable terms: ELBO(x; q) = E_{q(z|x)}[ log p(x | z) ] - KL( q(z | x) || p(z) ). Reconstruction (the decoder fits x from encoder-sampled latents) minus KL regularizer (the encoder stays close to the prior). They pull in opposite directions; training balances them.
  • The gap identity: log p(x) - ELBO(x; q) = KL( q(z | x) || p(z | x) ). The gap between the bound and the truth is the KL from the variational posterior (encoder) to the true posterior. Maximizing the ELBO does TWO good things at once: pushes log p(x) up AND pushes q(z | x) toward p(z | x). The bound is tight (gap = 0) exactly when q = p_posterior.
  • Worked binary anchor. Binary x, z, prior [0.5, 0.5], decoder p(x=1|z=0)=0.2, p(x=1|z=1)=0.8, so p(x=1) = 0.5 and log p(x=1) ≈ -0.6931. True posterior p(z=1|x=1) = 0.8. With imperfect encoder q(z=1|x=1) = 0.7: reconstruction ≈ -0.6390, KL-to-prior ≈ 0.0823, ELBO ≈ -0.7214, gap ≈ 0.028 = KL(q || true posterior) ≈ 0.028 to rounding. Setting q = 0.8 (true posterior) gives ELBO exactly -0.6931 = log p(x=1): the bound is tight.
  • Cross-paradigm consequence. A VAE’s reported “likelihood” is the ELBO, a lower bound on log p_model(x). Quoting it as a likelihood understates the model; cross-paradigm comparisons with autoregressive models (which give exact likelihood) require care. This is why the L3 cheatsheet listed VAEs as “lower bound (ELBO)” rather than exact.
  • The ELBO is reusable. The next lesson takes it to a real architecture (VAE + the reparameterization trick that makes it trainable by SGD); the diffusion lessons (12-14) will show the diffusion training objective is mathematically equivalent to a particular ELBO over noisy intermediate states. The machinery you build here is the backbone of Phase 2 and a hidden ingredient in Phase 3.

Before this lesson, the difference between an autoregressive model and a VAE was probably “VAE uses an encoder-decoder” without a precise statement of what the encoder is doing or why VAE training is harder. Now you have it: the encoder is a variational approximation to the true (intractable) posterior, and VAE training maximizes a tractable lower bound (the ELBO) that becomes tight as the encoder gets better. When you next see a VAE training-loss curve, you can read both terms; when you see “posterior collapse” or “beta-VAE” mentioned, you can place them precisely as a KL-term problem and a KL-term reweighting. The next lesson makes this concrete: real neural-network encoder and decoder, the reparameterization trick for trainability, and the VAE on real data.