Lesson: Latent variables and the ELBO
Phase 1 gave you three likelihood-based paradigms (autoregressive, normalizing flows, and the maximum-likelihood / KL framework that unifies them). All three could write the model log-likelihood exactly. This lesson introduces the fourth paradigm in the likelihood family, the latent-variable model, which is more flexible architecturally but pays for that flexibility with an intractable integral. The math response to that intractability is the evidence lower bound (ELBO), a tractable quantity that lies below the model log-likelihood and that maximizing pushes the model toward forward-KL minimization, just not exactly.
By the end you will be able to derive the ELBO in two lines using Jensen’s inequality, split it into its two interpretable terms (reconstruction + KL), explain exactly how far it is from the model log-likelihood (the gap is itself a KL divergence), and verify all of this numerically on a small example with binary data and a binary latent.
The next lesson takes the ELBO to a real architecture (the variational autoencoder). This lesson is the math under the architecture.
The setup: a hidden code
Section titled “The setup: a hidden code”A latent-variable model says that each observation is generated by first drawing a hidden code from a simple prior, then drawing the observation from a learned conditional (the decoder):
z ~ p(z) (prior, e.g. multivariate standard Gaussian)x ~ p(x | z) (learned decoder)The model distribution over the observation alone is then the integral of the joint over all possible latents:
p_model(x) = integral over z of p(x | z) · p(z) dzThis is the marginal likelihood (also called the evidence). The “marginal” is over the latent; the “likelihood” is what you would maximize under forward KL from L3. It is intractable in general for two reasons: the integral has no closed form when the decoder is a neural network, and even Monte Carlo over the latent is unhelpful because most random latents produce tiny decoder probability for any specific observation (you would need exponentially many samples to land near the modes that matter).
So the lesson 3 plan, “minimize the empirical negative log-likelihood,” does not directly work for latent-variable models. We need a different objective that is tractable AND that pushes the model in the same direction.
The variational distribution: introduce an encoder
Section titled “The variational distribution: introduce an encoder”The trick is to introduce a second distribution: a variational distribution, parameterized by another network (the encoder), that approximates the true posterior over latents given the observation. With any such variational distribution, we can rewrite the integral:
log p(x) = log [ integral over z of p(x, z) dz ] = log [ integral over z of q(z | x) · (p(x, z) / q(z | x)) dz ] = log [ E_{z ~ q(z|x)}[ p(x, z) / q(z | x) ] ]That is identically equal to the marginal log-likelihood; we have only multiplied and divided by the variational distribution inside the integral and reinterpreted the integral as an expectation under it. Nothing has changed yet.
Jensen’s inequality: turn the log of an expectation into an expectation of a log
Section titled “Jensen’s inequality: turn the log of an expectation into an expectation of a log”The next step is the only non-obvious one in the derivation. Apply Jensen’s inequality: for a concave function, the function of an expectation is at least the expectation of the function (and log is concave). So:
log E_{z ~ q(z|x)}[ p(x, z) / q(z | x) ] >= E_{z ~ q(z|x)}[ log( p(x, z) / q(z | x) ) ]The left side is the marginal log-likelihood we cannot compute. The right side is something we can compute (sample some latents from the variational distribution, evaluate the log-ratio at each, average). Call the right side the evidence lower bound, the ELBO:
ELBO(x; q) = E_{z ~ q(z|x)}[ log p(x, z) - log q(z | x) ]By construction: the ELBO is at most the marginal log-likelihood, with equality if and only if the variational distribution equals the true posterior exactly.
Splitting the ELBO into two interpretable terms
Section titled “Splitting the ELBO into two interpretable terms”The ELBO as written is one expression. The standard move is to factor the joint as the decoder times the prior and split:
ELBO(x; q) = E_{z ~ q(z|x)}[ log p(x | z) + log p(z) - log q(z | x) ] = E_{z ~ q(z|x)}[ log p(x | z) ] + E_{z ~ q(z|x)}[ log p(z) - log q(z | x) ] = E_{z ~ q(z|x)}[ log p(x | z) ] - KL( q(z | x) || p(z) )The two terms have clean names and clean roles:
Term 1: reconstruction. The expected decoder log-likelihood under the encoder measures how well the decoder reconstructs the observation from latents drawn from the encoder. Maximizing this term pushes the decoder to assign high probability to the true observation when conditioned on encoder-produced latents.
Term 2: KL regularizer. The KL divergence from the encoder to the prior measures how far the encoder’s posterior diverges from the prior. Subtracting it (we want to maximize the ELBO, so this term gets pushed toward zero) keeps the encoder from collapsing the latent space onto a narrow region that the prior would never sample from.
These two pull in different directions. The reconstruction term wants the encoder to give the decoder enough information to rebuild the observation (which would mean a sharp, informative encoder). The KL term wants the encoder to stay close to the prior (which would mean a vague, prior-like encoder). Training balances them.
The gap: how far is the ELBO from the log-likelihood?
Section titled “The gap: how far is the ELBO from the log-likelihood?”Jensen’s inequality gives a bound but does not tell you how tight. The gap turns out to have a clean form. Reverse-engineering the derivation:
log p(x) - ELBO(x; q) = KL( q(z | x) || p(z | x) )The gap is the KL divergence from the variational posterior (the encoder’s approximation) to the true posterior (what the model would induce if we could compute it). The gap is zero exactly when the encoder matches the true posterior, the case in which Jensen’s inequality is tight.
This is why the ELBO is “the closest thing to forward-KL minimization the latent-variable paradigm allows.” Maximizing the ELBO does two things at once: it pushes the model log-likelihood up (the goal we cannot reach directly), and it pushes the variational posterior closer to the true posterior (closing the bound’s gap). Both moves are good; the encoder gets to be more accurate AND the model gets a tighter bound to optimize, simultaneously.
A worked numerical example
Section titled “A worked numerical example”Take a binary observation that takes value zero or one and a binary latent that takes value zero or one. Set up the model:
Prior: p(z = 0) = p(z = 1) = 0.5Decoder: p(x = 1 | z = 0) = 0.2 p(x = 1 | z = 1) = 0.8The model’s marginal probability that the observation is one:
p(x = 1) = p(x = 1 | z = 0) · p(z = 0) + p(x = 1 | z = 1) · p(z = 1) = 0.2 · 0.5 + 0.8 · 0.5 = 0.5So the log probability that the observation is one equals natural log of one-half, approximately negative 0.693. This is the target.
The true posterior, by Bayes’ rule:
p(z = 1 | x = 1) = p(x = 1 | z = 1) · p(z = 1) / p(x = 1) = 0.8 · 0.5 / 0.5 = 0.8Now suppose our encoder gives an imperfect approximation: the encoder’s probability that the latent is one given the observation is one equals 0.7 (close to the true 0.8 but not exact). Compute the ELBO:
Reconstruction term: E_{z ~ q}[ log p(x = 1 | z) ] = q(z = 0) · log p(x = 1 | z = 0) + q(z = 1) · log p(x = 1 | z = 1) = 0.3 · ln(0.2) + 0.7 · ln(0.8) ≈ 0.3 · (-1.6094) + 0.7 · (-0.2231) ≈ -0.4828 + -0.1562 ≈ -0.6390
KL term: KL(q(z | x = 1) || p(z)) = 0.3 · ln(0.3 / 0.5) + 0.7 · ln(0.7 / 0.5) = 0.3 · ln(0.6) + 0.7 · ln(1.4) ≈ 0.3 · (-0.5108) + 0.7 · (0.3365) ≈ -0.1532 + 0.2356 ≈ 0.0823
ELBO = reconstruction - KL = -0.6390 - 0.0823 ≈ -0.7214Now the gap to the marginal log-probability:
gap = log p(x = 1) - ELBO = -0.6931 - (-0.7214) ≈ 0.0282Verify the identity: the gap should equal the KL divergence from the encoder to the true posterior:
KL(q || p_posterior) = 0.3 · ln(0.3 / 0.2) + 0.7 · ln(0.7 / 0.8) = 0.3 · ln(1.5) + 0.7 · ln(0.875) ≈ 0.3 · (0.4055) + 0.7 · (-0.1335) ≈ 0.1216 + -0.0935 ≈ 0.0281Match (to rounding). The identity (the log-likelihood equals the ELBO plus the KL gap) holds numerically, as the derivation required.
The tight case. Now suppose we set the encoder probability that the latent is one given the observation is one to 0.8, exactly the true posterior. Recompute:
Reconstruction: 0.2 · ln(0.2) + 0.8 · ln(0.8) ≈ 0.2 · (-1.6094) + 0.8 · (-0.2231) ≈ -0.3219 + -0.1785 ≈ -0.5004
KL(q || prior): 0.2 · ln(0.4) + 0.8 · ln(1.6) ≈ 0.2 · (-0.9163) + 0.8 · (0.4700) ≈ -0.1833 + 0.3760 ≈ 0.1927
ELBO = -0.5004 - 0.1927 = -0.6931Exactly the same as the marginal log-likelihood of negative 0.6931. When the variational posterior matches the true posterior, the ELBO equals the log-likelihood. Jensen’s inequality is tight, and the bound becomes exact.
The same identity also produces a fact worth holding onto for later lessons: maximizing the ELBO is equivalent to simultaneously maximizing the model log-likelihood and minimizing the KL from the encoder to the true posterior. Both moves are good; they happen at the same time.
Why this matters for the rest of the track
Section titled “Why this matters for the rest of the track”The ELBO is the backbone of the latent-variable paradigm. The next lesson takes it to a concrete architecture (the VAE), shows how to parameterize the decoder and the encoder with neural networks, and introduces the reparameterization trick that makes the ELBO trainable by stochastic gradient descent through Monte Carlo sampling. From there, the same ELBO framework underlies:
Hierarchical VAEs, which stack multiple layers of latent variables and write a chain of ELBOs that telescope.
Diffusion models, surprisingly. Lesson 14 will show that the diffusion training objective, although it does not look like an ELBO at first, is mathematically equivalent to a particular ELBO derived from a multi-step latent-variable model where the latents are the noisy intermediate states.
Many representation-learning systems, where the encoder is the object of interest in itself (the latent code becomes the representation), and the decoder plus KL term are the regularizers that make the representation useful.
So this lesson’s two-line derivation has long arms.
Why this matters when you use AI
Section titled “Why this matters when you use AI”Two practical implications.
You cannot compare a VAE to an autoregressive model by likelihood. A VAE gives you the ELBO, which is a lower bound on the model log-likelihood, not the actual value. The gap (typically positive and meaningful) is the KL from the encoder to the true posterior, which depends on how good the encoder is. So a VAE’s reported “likelihood” is not directly comparable to an autoregressive model’s exact likelihood; the VAE’s number is conservative (lower than the truth) by an unknown amount. This is why the cross-paradigm comparison table in the L3 cheatsheet listed VAEs as “Lower bound (ELBO)” while autoregressive and flows are “Exact.”
The two ELBO terms predict VAE behavior. If a VAE is under-using its latent space (the famous “posterior collapse” problem in older VAEs), the KL term has driven the encoder to match the prior exactly, which means the encoder is ignoring the observation. Diagnosing this requires recognizing which ELBO term is responsible. Modern VAE variants (beta-VAE, KL annealing, free bits) adjust the relative weights of the two terms to control this behavior. Reading any VAE training-loss curve is reading the ELBO; recognizing which term is doing what comes from this derivation.
Common pitfalls
Section titled “Common pitfalls”Mistaking the ELBO for the likelihood. The ELBO is a lower bound on the model log-likelihood. It can be made tight in principle (when the encoder matches the true posterior) but never exceeds the true likelihood. Quoting an ELBO as a likelihood understates the model.
Forgetting that Jensen’s inequality requires log concavity. The same derivation does not work with arbitrary functions of the expectation; it works specifically because log is concave, which is what produces the direction of the inequality (the right side is the lower bound, not the upper). Replace log with a convex function and the inequality flips.
Treating the encoder as fixed. The variational distribution is a learnable component; training optimizes the ELBO over BOTH the model parameters (decoder plus prior, though prior is usually fixed) AND the encoder parameters. Forgetting the encoder is also trained is a common bug.
Reading “KL term” as a regularizer alone. The KL term in the ELBO is not a hand-added regularizer; it falls out of the derivation. Reading it as optional (or scaling it without understanding the trade-off) is what produces posterior collapse and other VAE pathologies.
What you should remember
Section titled “What you should remember”- A latent-variable model has a model distribution defined by an intractable marginal, the integral of the decoder times the prior over all latents. We cannot compute the model log-likelihood directly to train on it, so we maximize a tractable lower bound instead.
- The evidence lower bound (ELBO), derived in two lines using Jensen’s inequality, equals the expected decoder log-likelihood under the encoder minus the KL divergence from the encoder to the prior. Two terms: reconstruction (decoder fits the observation from encoder-produced latents) minus KL (encoder’s posterior stays close to the prior).
- The gap between the ELBO and the model log-likelihood equals the KL from the encoder to the true posterior. Maximizing the ELBO simultaneously maximizes the model log-likelihood and tightens the bound. Worked binary example: with an encoder probability of 0.7 (true posterior 0.8), ELBO is approximately negative 0.7214, the marginal log-likelihood is negative 0.6931, gap is approximately 0.028, equal to the KL gap of approximately 0.028 to rounding.
You now have the math behind the latent-variable paradigm. The next lesson takes the ELBO to a concrete architecture, the variational autoencoder, where the encoder and decoder are neural networks and the reparameterization trick makes the whole thing trainable by stochastic gradient descent.