Skip to content

Cheatsheet: Latent variables and the ELBO

z ~ p(z) prior over latents (e.g. standard Gaussian)
x ~ p(x | z) learned decoder
p_model(x) = integral over z of p(x|z) · p(z) dz <-- INTRACTABLE

We cannot compute log p_model(x) directly, so we cannot train by NLL = forward KL minimization the way Phase 1 did. We need a tractable surrogate.

The ELBO derivation (Jensen’s inequality, two lines)

Section titled “The ELBO derivation (Jensen’s inequality, two lines)”

Introduce a variational distribution q(z | x) (the encoder, learnable):

log p(x) = log integral q(z|x) · [p(x,z) / q(z|x)] dz
= log E_{z ~ q(z|x)}[ p(x,z) / q(z|x) ]
>= E_{z ~ q(z|x)}[ log( p(x,z) / q(z|x) ) ] (Jensen, log is concave)
= ELBO(x; q)

So ELBO(x; q) <= log p(x), equality iff q(z | x) = p(z | x).

Factor p(x, z) = p(x | z) · p(z) inside the ELBO and rearrange:

ELBO(x; q) = E_{z ~ q(z|x)}[ log p(x|z) ] - KL( q(z|x) || p(z) )
\---- reconstruction ----/ \--- KL regularizer ---/
TermPulls towardPush direction
Reconstructionsharp, informative q(z|x)Encoder gives decoder enough info to rebuild x
KL regularizervague, prior-like q(z|x)Encoder stays close to the prior

Training balances the two. Posterior collapse (the classical VAE pathology) happens when the KL term wins too hard and the encoder ignores x.

log p(x) - ELBO(x; q) = KL( q(z|x) || p(z|x) )

The gap is the KL from the variational posterior to the true posterior. Maximizing the ELBO does two things at once: push log p(x) up, and push q(z|x) toward p(z|x) (closing the gap).

Worked binary example (verify the identity numerically)

Section titled “Worked binary example (verify the identity numerically)”

Setup: x ∈ {0,1}, z ∈ {0,1}, p(z) = [0.5, 0.5], decoder p(x=1 | z=0) = 0.2, p(x=1 | z=1) = 0.8.

  • Marginal: p(x=1) = 0.5 · 0.2 + 0.5 · 0.8 = 0.5, so log p(x=1) = ln(0.5) ≈ -0.6931
  • True posterior: p(z=1 | x=1) = 0.8 · 0.5 / 0.5 = 0.8

Imperfect encoder q(z=1 | x=1) = 0.7:

QuantityComputationValue
Reconstruction0.3·ln(0.2) + 0.7·ln(0.8)≈ -0.6390
KL(qprior)
ELBOreconstruction - KL≈ -0.7214
Gaplog p(x) - ELBO≈ 0.0282
KL(qtrue posterior)

Gap = KL(q || true posterior), as the identity requires (match within rounding).

Tight encoder q(z=1 | x=1) = 0.8 (= true posterior):

QuantityValue
Reconstruction0.2·ln(0.2) + 0.8·ln(0.8) ≈ -0.5004
KL(q
ELBO-0.6931
Gap0

ELBO equals log p(x=1) exactly when q = true posterior. Jensen’s inequality is tight.

  • Lesson 6 (VAE). Same ELBO, parameterized with neural networks; the reparameterization trick makes it trainable.
  • Diffusion (lesson 14). The diffusion training objective is mathematically equivalent to an ELBO for a multi-step latent-variable model whose latents are the noisy intermediate states.
  • Cross-paradigm comparison. VAE’s reported “likelihood” is the ELBO, a lower bound. Not directly comparable to an autoregressive model’s exact likelihood. Quoting an ELBO as a likelihood understates the model.
  • ELBO = likelihood. No, ELBO ≤ log p(x); the gap is KL(q || true posterior), generally positive.
  • Jensen with arbitrary functions. The inequality goes the right way only because log is concave; replace it and the inequality flips.
  • q(z|x) is fixed. No, the encoder is learned along with the model; training optimizes over both.
  • KL term as optional regularizer. It falls out of the derivation, not added by hand. Scaling it without understanding the trade-off produces posterior collapse and other VAE pathologies.

The ELBO is the tractable lower bound on log p_model(x) that we maximize when the marginal integral is intractable; it splits into reconstruction minus KL-to-prior, and the gap to the true log-likelihood is itself a KL from the variational posterior to the true posterior, which maximizing the ELBO automatically closes.