Skip to content

Maximum likelihood and the KL view

This is lesson 3 of Track 19 (Generative Models and Diffusion), and it gives the theoretical justification for the training objective the previous lesson minimized without explanation. By the end you will be able to define the KL divergence, show in one line of algebra that minimizing the forward KL from the data distribution to the model is the same as maximizing the expected log-likelihood, recognize the empirical NLL you have been minimizing as the Monte Carlo estimate of that expectation on a finite training set, and translate freely between the three names this same objective travels under (forward KL, cross-entropy, NLL). The derivation is the spine of every likelihood-based paradigm in the track: autoregressive, flow, and (via the ELBO lower bound) VAE training all reduce to forward-KL minimization. GANs explicitly use a different objective; diffusion is connected through a chain of equivalences.

This is lesson 3 of 15, the third step of Phase 1 (generative foundations). It justifies the NLL objective the previous lesson took on faith. The next lesson, Normalizing flows, applies this same objective to a new architecture: an invertible transformation parameterizes p_model(x) exactly via the change-of-variables formula from Track 4, and the model is trained by the same empirical NLL = forward-KL minimization. The KL view explains why two such different architectures (autoregressive and flow) share an objective, and it sets up the ELBO derivation in lesson 5 as the closest thing to forward-KL minimization when the exact log-likelihood is intractable.

Prerequisites: the previous lesson, Autoregressive models, factoring by the chain rule, for the NLL loss being justified. The math background: comfort with expectations (sums weighted by probabilities), logs, and a one-line algebraic manipulation (linearity of expectation, splitting log(a/b) into log a - log b). One short Monte Carlo argument is invoked (sample average estimates an expectation). No calculus is used.

The lesson uses three formal ingredients: the KL divergence (one definition), linearity of expectation (one algebraic step), and a Monte Carlo estimate (one approximation move). The arithmetic is one worked numerical KL on a 2-outcome distribution; the practice extends it to 3 outcomes and verifies the identity cross-entropy = KL + entropy numerically. This lesson is denser conceptually than computationally; the work is in following one derivation carefully, not in pushing many numbers around.

  • Write the definition of the KL divergence and state its three key properties (non-negative, zero only at equality, asymmetric)
  • Show in one line of algebra that minimizing the forward KL from data to model is equivalent to maximizing the expected log-likelihood under the data
  • Translate from the expected log-likelihood to the empirical NLL via a Monte Carlo estimate, and recognize this as the same loss from the autoregressive lesson
  • Apply the identity KL(p_data || p_model) = H(p_data, p_model) - H(p_data) to translate freely between forward KL, cross-entropy, and NLL
  • Explain why forward KL (not reverse KL) is the practical choice for training generative models, and which paradigms do and do not share this objective
  • Read time: about 12 minutes
  • Practice time: about 16 minutes (a six-question self-check, a compute-the-KL exercise on a 3-outcome case, a cross-entropy = KL + entropy verification, and flashcards)
  • Difficulty: standard (a Phase 1 lesson; light arithmetic, but the one-line derivation deserves careful reading)