Cheatsheet: Latent variables and the ELBO
The setup
Section titled “The setup”z ~ p(z) prior over latents (e.g. standard Gaussian)x ~ p(x | z) learned decoderp_model(x) = integral over z of p(x|z) · p(z) dz <-- INTRACTABLEWe cannot compute log p_model(x) directly, so we cannot train by NLL = forward KL minimization the way Phase 1 did. We need a tractable surrogate.
The ELBO derivation (Jensen’s inequality, two lines)
Section titled “The ELBO derivation (Jensen’s inequality, two lines)”Introduce a variational distribution q(z | x) (the encoder, learnable):
log p(x) = log integral q(z|x) · [p(x,z) / q(z|x)] dz = log E_{z ~ q(z|x)}[ p(x,z) / q(z|x) ] >= E_{z ~ q(z|x)}[ log( p(x,z) / q(z|x) ) ] (Jensen, log is concave) = ELBO(x; q)So ELBO(x; q) <= log p(x), equality iff q(z | x) = p(z | x).
The reconstruction + KL split
Section titled “The reconstruction + KL split”Factor p(x, z) = p(x | z) · p(z) inside the ELBO and rearrange:
ELBO(x; q) = E_{z ~ q(z|x)}[ log p(x|z) ] - KL( q(z|x) || p(z) ) \---- reconstruction ----/ \--- KL regularizer ---/| Term | Pulls toward | Push direction |
|---|---|---|
| Reconstruction | sharp, informative q(z|x) | Encoder gives decoder enough info to rebuild x |
| KL regularizer | vague, prior-like q(z|x) | Encoder stays close to the prior |
Training balances the two. Posterior collapse (the classical VAE pathology) happens when the KL term wins too hard and the encoder ignores x.
The gap identity
Section titled “The gap identity”log p(x) - ELBO(x; q) = KL( q(z|x) || p(z|x) )The gap is the KL from the variational posterior to the true posterior. Maximizing the ELBO does two things at once: push log p(x) up, and push q(z|x) toward p(z|x) (closing the gap).
Worked binary example (verify the identity numerically)
Section titled “Worked binary example (verify the identity numerically)”Setup: x ∈ {0,1}, z ∈ {0,1}, p(z) = [0.5, 0.5], decoder p(x=1 | z=0) = 0.2, p(x=1 | z=1) = 0.8.
- Marginal:
p(x=1) = 0.5 · 0.2 + 0.5 · 0.8 = 0.5, solog p(x=1) = ln(0.5) ≈ -0.6931 - True posterior:
p(z=1 | x=1) = 0.8 · 0.5 / 0.5 = 0.8
Imperfect encoder q(z=1 | x=1) = 0.7:
| Quantity | Computation | Value |
|---|---|---|
| Reconstruction | 0.3·ln(0.2) + 0.7·ln(0.8) | ≈ -0.6390 |
| KL(q | prior) | |
| ELBO | reconstruction - KL | ≈ -0.7214 |
| Gap | log p(x) - ELBO | ≈ 0.0282 |
| KL(q | true posterior) |
Gap = KL(q || true posterior), as the identity requires (match within rounding).
Tight encoder q(z=1 | x=1) = 0.8 (= true posterior):
| Quantity | Value |
|---|---|
| Reconstruction | 0.2·ln(0.2) + 0.8·ln(0.8) ≈ -0.5004 |
| KL(q | |
| ELBO | -0.6931 |
| Gap | 0 |
ELBO equals log p(x=1) exactly when q = true posterior. Jensen’s inequality is tight.
Why it matters across the track
Section titled “Why it matters across the track”- Lesson 6 (VAE). Same ELBO, parameterized with neural networks; the reparameterization trick makes it trainable.
- Diffusion (lesson 14). The diffusion training objective is mathematically equivalent to an ELBO for a multi-step latent-variable model whose latents are the noisy intermediate states.
- Cross-paradigm comparison. VAE’s reported “likelihood” is the ELBO, a lower bound. Not directly comparable to an autoregressive model’s exact likelihood. Quoting an ELBO as a likelihood understates the model.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- ELBO = likelihood. No, ELBO ≤ log p(x); the gap is KL(q || true posterior), generally positive.
- Jensen with arbitrary functions. The inequality goes the right way only because
logis concave; replace it and the inequality flips. q(z|x)is fixed. No, the encoder is learned along with the model; training optimizes over both.- KL term as optional regularizer. It falls out of the derivation, not added by hand. Scaling it without understanding the trade-off produces posterior collapse and other VAE pathologies.
The one-line version
Section titled “The one-line version”The ELBO is the tractable lower bound on log p_model(x) that we maximize when the marginal integral is intractable; it splits into reconstruction minus KL-to-prior, and the gap to the true log-likelihood is itself a KL from the variational posterior to the true posterior, which maximizing the ELBO automatically closes.