Summary: Maximum likelihood and the KL view
The previous lesson said “minimize the negative log-likelihood” without justifying it. This lesson does the justification. The whole thing reduces to one line: maximum likelihood is the empirical version of minimizing the forward KL divergence from the data distribution to the model, and that single derivation explains why every likelihood-based paradigm in this track (autoregressive, flow, VAE-via-ELBO) shares the same training objective. This is the scan-it-in-five-minutes version.
Core ideas
Section titled “Core ideas”- A generative model’s training problem is to make
p_model(x)close to the data distributionp_data(x), given only samples fromp_data. The information-theoretic measure of “close” is the KL divergence. - The KL divergence is
KL(p || q) = E_{x ~ p}[ log(p(x)/q(x)) ]. It is non-negative, equals zero only whenp = qeverywhere, and is asymmetric (KL(p || q) != KL(q || p)). It is not a metric, and the asymmetry is real. - The forward KL is
KL(p_data || p_model), with data on the left. Expand and split with linearity of expectation:KL(p_data || p_model) = E_{p_data}[log p_data(x)] - E_{p_data}[log p_model(x)]. The first term is-H(p_data), a model-independent constant. The second term is the expected log-likelihood of the model under the data, the only term we can affect. - Conclusion (the one-line derivation): minimizing forward KL is equivalent to maximizing the expected log-likelihood of the model under the data. On a finite training set, the Monte Carlo estimate is the sample mean
(1/N) sum_i log p_model(x_i), so maximum likelihood becomes empirical NLL minimization, the same loss from L2. - Three names, one objective. Forward KL, cross-entropy
H(p_data, p_model) = E_{p_data}[-log p_model], and empirical NLL differ only by constants (the data’s entropy) and a finite-sample approximation. The identityKL(p_data || p_model) = H(p_data, p_model) - H(p_data)links them. - Worked anchor:
p_data = [0.5, 0.5],p_model = [0.7, 0.3]givesKL = 0.5·ln(5/7) + 0.5·ln(5/3) ≈ -0.168 + 0.255 ≈ 0.087nats. Match the data (p_model = [0.5, 0.5]) andKL = 0exactly. - Why forward KL, not reverse. Forward KL has expectations under
p_data(free via samples) and only needs to evaluatep_model(we can). Reverse KL would needlog p_data(x), which we cannot evaluate. Forward KL is also mass-covering (penalizes assigning low probability where data has high probability), the usual generative-modeling preference. Reverse KL is mode-seeking (concentrates on a few modes), a different qualitative behavior. - Across the four paradigms: autoregressive, flow, and VAE-via-ELBO all train on forward KL (the last as a lower bound, the ELBO); GANs explicitly do not (adversarial divergence: JS or Wasserstein); diffusion is connected through a chain of equivalences (lesson 14). This is why perplexity (=
exp(NLL/token)) and bits-per-pixel are usable cross-model metrics for likelihood-based models, and why GAN samples cannot be compared by likelihood.
What changes for you
Section titled “What changes for you”Before this lesson, “maximum likelihood” and “cross-entropy” and “KL minimization” probably felt like three different things, with NLL as a black-box loss inherited from the deep-learning toolkit. Now they are one objective seen from three sides, with NLL the empirical-sample form of forward KL minimization. When you next read a paper that talks about “minimizing the forward KL” or “minimizing the cross-entropy” or “maximizing the log-likelihood,” you can translate freely. When you compare two language models by perplexity, you know what makes the comparison meaningful (shared forward-KL objective on shared data) and why it does not transfer across paradigms (a GAN has nothing to compare). The next lesson takes this same NLL = forward-KL minimization to normalizing flows, where the change-of-variables formula from Track 4 makes p_model(x) exactly evaluable through an invertible transformation, then trains it with the same loss.