Summary: Maximum likelihood and the KL view

The previous lesson said “minimize the negative log-likelihood” without justifying it. This lesson does the justification. The whole thing reduces to one line: maximum likelihood is the empirical version of minimizing the forward KL divergence from the data distribution to the model, and that single derivation explains why every likelihood-based paradigm in this track (autoregressive, flow, VAE-via-ELBO) shares the same training objective. This is the scan-it-in-five-minutes version.

Core ideas

A generative model’s training problem is to make p_model(x) close to the data distribution p_data(x), given only samples from p_data. The information-theoretic measure of “close” is the KL divergence.
The KL divergence is KL(p || q) = E_{x ~ p}[ log(p(x)/q(x)) ]. It is non-negative, equals zero only when p = q everywhere, and is asymmetric (KL(p || q) != KL(q || p)). It is not a metric, and the asymmetry is real.
The forward KL is KL(p_data || p_model), with data on the left. Expand and split with linearity of expectation: KL(p_data || p_model) = E_{p_data}[log p_data(x)] - E_{p_data}[log p_model(x)]. The first term is -H(p_data), a model-independent constant. The second term is the expected log-likelihood of the model under the data, the only term we can affect.
Conclusion (the one-line derivation): minimizing forward KL is equivalent to maximizing the expected log-likelihood of the model under the data. On a finite training set, the Monte Carlo estimate is the sample mean (1/N) sum_i log p_model(x_i), so maximum likelihood becomes empirical NLL minimization, the same loss from L2.
Three names, one objective. Forward KL, cross-entropy H(p_data, p_model) = E_{p_data}[-log p_model], and empirical NLL differ only by constants (the data’s entropy) and a finite-sample approximation. The identity KL(p_data || p_model) = H(p_data, p_model) - H(p_data) links them.
Worked anchor: p_data = [0.5, 0.5], p_model = [0.7, 0.3] gives KL = 0.5·ln(5/7) + 0.5·ln(5/3) ≈ -0.168 + 0.255 ≈ 0.087 nats. Match the data (p_model = [0.5, 0.5]) and KL = 0 exactly.
Why forward KL, not reverse. Forward KL has expectations under p_data (free via samples) and only needs to evaluate p_model (we can). Reverse KL would need log p_data(x), which we cannot evaluate. Forward KL is also mass-covering (penalizes assigning low probability where data has high probability), the usual generative-modeling preference. Reverse KL is mode-seeking (concentrates on a few modes), a different qualitative behavior.
Across the four paradigms: autoregressive, flow, and VAE-via-ELBO all train on forward KL (the last as a lower bound, the ELBO); GANs explicitly do not (adversarial divergence: JS or Wasserstein); diffusion is connected through a chain of equivalences (lesson 14). This is why perplexity (= exp(NLL/token)) and bits-per-pixel are usable cross-model metrics for likelihood-based models, and why GAN samples cannot be compared by likelihood.

What changes for you

Before this lesson, “maximum likelihood” and “cross-entropy” and “KL minimization” probably felt like three different things, with NLL as a black-box loss inherited from the deep-learning toolkit. Now they are one objective seen from three sides, with NLL the empirical-sample form of forward KL minimization. When you next read a paper that talks about “minimizing the forward KL” or “minimizing the cross-entropy” or “maximizing the log-likelihood,” you can translate freely. When you compare two language models by perplexity, you know what makes the comparison meaningful (shared forward-KL objective on shared data) and why it does not transfer across paradigms (a GAN has nothing to compare). The next lesson takes this same NLL = forward-KL minimization to normalizing flows, where the change-of-variables formula from Track 4 makes p_model(x) exactly evaluable through an invertible transformation, then trains it with the same loss.