Maximum likelihood and the KL view: brief

What you’ll learn

This is lesson 3 of Track 19 (Generative Models and Diffusion), and it gives the theoretical justification for the training objective the previous lesson minimized without explanation. By the end you will be able to define the KL divergence, show in one line of algebra that minimizing the forward KL from the data distribution to the model is the same as maximizing the expected log-likelihood, recognize the empirical NLL you have been minimizing as the Monte Carlo estimate of that expectation on a finite training set, and translate freely between the three names this same objective travels under (forward KL, cross-entropy, NLL). The derivation is the spine of every likelihood-based paradigm in the track: autoregressive, flow, and (via the ELBO lower bound) VAE training all reduce to forward-KL minimization. GANs explicitly use a different objective; diffusion is connected through a chain of equivalences.

Where this fits

This is lesson 3 of 15, the third step of Phase 1 (generative foundations). It justifies the NLL objective the previous lesson took on faith. The next lesson, Normalizing flows, applies this same objective to a new architecture: an invertible transformation parameterizes p_model(x) exactly via the change-of-variables formula from Track 4, and the model is trained by the same empirical NLL = forward-KL minimization. The KL view explains why two such different architectures (autoregressive and flow) share an objective, and it sets up the ELBO derivation in lesson 5 as the closest thing to forward-KL minimization when the exact log-likelihood is intractable.

Before you start

Prerequisites: the previous lesson, Autoregressive models, factoring by the chain rule, for the NLL loss being justified. The math background: comfort with expectations (sums weighted by probabilities), logs, and a one-line algebraic manipulation (linearity of expectation, splitting log(a/b) into log a - log b). One short Monte Carlo argument is invoked (sample average estimates an expectation). No calculus is used.

About the math

The lesson uses three formal ingredients: the KL divergence (one definition), linearity of expectation (one algebraic step), and a Monte Carlo estimate (one approximation move). The arithmetic is one worked numerical KL on a 2-outcome distribution; the practice extends it to 3 outcomes and verifies the identity cross-entropy = KL + entropy numerically. This lesson is denser conceptually than computationally; the work is in following one derivation carefully, not in pushing many numbers around.

By the end, you’ll be able to

Write the definition of the KL divergence and state its three key properties (non-negative, zero only at equality, asymmetric)
Show in one line of algebra that minimizing the forward KL from data to model is equivalent to maximizing the expected log-likelihood under the data
Translate from the expected log-likelihood to the empirical NLL via a Monte Carlo estimate, and recognize this as the same loss from the autoregressive lesson
Apply the identity KL(p_data || p_model) = H(p_data, p_model) - H(p_data) to translate freely between forward KL, cross-entropy, and NLL
Explain why forward KL (not reverse KL) is the practical choice for training generative models, and which paradigms do and do not share this objective

Time and difficulty

Read time: about 12 minutes
Practice time: about 16 minutes (a six-question self-check, a compute-the-KL exercise on a 3-outcome case, a cross-entropy = KL + entropy verification, and flashcards)
Difficulty: standard (a Phase 1 lesson; light arithmetic, but the one-line derivation deserves careful reading)