Cheatsheet: Maximum likelihood and the KL view
KL divergence (definition)
Section titled “KL divergence (definition)”KL(p || q) = E_{x ~ p}[ log(p(x) / q(x)) ] = sum_x p(x) · log(p(x) / q(x))(Integral for continuous x.)
| Property | Statement |
|---|---|
| Non-negative | `KL(p |
| Zero only at equality | `KL(p |
| Asymmetric | `KL(p |
The derivation (one line)
Section titled “The derivation (one line)”KL(p_data || p_model) = E_{p_data}[ log p_data(x) ] - E_{p_data}[ log p_model(x) ] = -H(p_data) - E_{p_data}[ log p_model(x) ] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the only model-dependent termTherefore:
minimize KL(p_data || p_model) == maximize E_{p_data}[ log p_model(x) ]The first term is the negative entropy of the data, a model-independent constant.
From expectation to dataset (Monte Carlo)
Section titled “From expectation to dataset (Monte Carlo)”With N samples {x_1, ..., x_N} from p_data:
E_{p_data}[ log p_model(x) ] ≈ (1/N) · sum_i log p_model(x_i)Maximizing this = maximum likelihood. Equivalently:
minimize -(1/N) · sum_i log p_model(x_i) ← empirical NLLThis is exactly the loss from L2.
Three names, one objective
Section titled “Three names, one objective”| Name | Formula | Differs by |
|---|---|---|
| Forward KL | `KL(p_data | |
| Cross-entropy | H(p_data, p_model) = E_{p_data}[-log p_model] | minus H(p_data) (constant) |
| Empirical NLL | -(1/N) sum_i log p_model(x_i) | finite-sample MC estimate |
Identity: KL(p_data || p_model) = H(p_data, p_model) - H(p_data).
Worked numerical example
Section titled “Worked numerical example”x ∈ {A, B}, data uniform p_data = [0.5, 0.5].
| Model | p_model(A) | p_model(B) | KL (nats) |
|---|---|---|---|
| Untrained | 0.7 | 0.3 | 0.5·ln(5/7) + 0.5·ln(5/3) ≈ -0.168 + 0.255 ≈ 0.087 |
| Matched | 0.5 | 0.5 | 0.5·ln(1) + 0.5·ln(1) = 0 |
Positive when mismatched; exactly 0 when matched.
Forward vs reverse KL (why forward wins for training)
Section titled “Forward vs reverse KL (why forward wins for training)”- Forward KL
KL(p_data || p_model): expectations underp_data(we have samples); we only need to evaluatep_model. Mass-covering: penalizes assigning low probability where data has high probability. - Reverse KL
KL(p_model || p_data): expectations underp_model(fine, can sample), but the integrand needslog p_data(x)(we cannot evaluate). Mode-seeking: penalizes assigning high probability where data has low probability.
Forward KL fits sample-only access and the usual generative-modeling preference for covering all of p_data.
Which paradigms share this objective
Section titled “Which paradigms share this objective”| Paradigm | Trains on forward KL? |
|---|---|
| Autoregressive (L2) | Yes (NLL = empirical forward KL) |
| Normalizing flows (L4) | Yes (exact density + same empirical NLL) |
| VAEs (L5-6) | Yes-but-bounded (ELBO is a lower bound; closes when variational approximation is exact) |
| GANs (L7-8) | No (adversarial divergence: JS, Wasserstein, etc.) |
| Diffusion (L12-14) | Equivalent through a chain (denoising score-matching ↔ weighted KL bound) |
Why it matters for AI
Section titled “Why it matters for AI”- Cross-architecture comparison. Two autoregressive LLMs are comparable by perplexity =
exp(NLL/token); two flow image models by bits-per-pixel. Both are forward-KL estimates in interpretable units. GANs are not comparable this way. - Diagnosing training. Good training NLL + bad validation NLL = overfit on the empirical KL. Mode collapse is mode-seeking behavior, which forward KL does not encourage.
- Reading papers. “Cross-entropy loss,” “NLL,” and “forward KL minimization” are the same objective in different traditions; recognizing the equivalence lets you bridge papers.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- KL as a distance. No. KL is asymmetric and not a metric; forward vs reverse choice matters.
- Cross-entropy = entropy. No.
H(p, q)(cross-entropy) depends on both;H(p)(entropy) only on the first. - NLL is a probability. No. NLL is a non-negative number in nats (or bits with log_2). Only relative NLL values across models/epochs are meaningful.
- Forgetting the constant.
H(p_data)does not depend on the model, so absolute NLL is uninformative; deltas are what train on.
The one-line version
Section titled “The one-line version”Maximum likelihood is the empirical version of minimizing the forward KL from data to model, and three of the four paradigms in this track (autoregressive, flow, VAE-via-ELBO) all reduce to this same objective in different costumes.