Skip to content

References: Maximum likelihood and the KL view

Source curricula (multi-source structural mirror; cited as further study):
PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 4: Maximum Likelihood Learning
Instructor: Stefano Ermon
Course URL: https://deepgenerativemodels.github.io/
Syllabus: https://deepgenerativemodels.github.io/syllabus.html
License: standard course-page link-out; cited as further study
SECONDARY (parallel framing where applicable; CS294-158's parallel content
appears throughout L1-L6 of that course rather than as a single lecture on MLE)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024)
Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
License: standard course-page link-out; cited as further study
Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.
  • Stanford CS236 (Stefano Ermon), course homepage. The primary anchor for this lesson; Lecture 4 (Maximum Likelihood Learning) walks the KL derivation, the forward-vs-reverse choice, and the connection to cross-entropy that this lesson mirrors. The course notes at deepgenerativemodels.github.io/notes include a written treatment with worked algebra that complements the lecture slides.

  • Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. The course’s parallel treatment of MLE appears across lectures L1-L4 rather than in a single dedicated lecture; the autoregressive (L2), flow (L3), and latent-variable (L4) lectures each invoke the same forward-KL = NLL identity that this lesson derives from first principles.

A short, durable list. Each link is a specific next step, not a generic pile.

  • “Elements of Information Theory” by Thomas M. Cover and Joy A. Thomas (Wiley), especially Chapter 2 (Entropy, Relative Entropy, and Mutual Information). The canonical graduate-level treatment of KL divergence, cross-entropy, and the inequalities (Jensen, Gibbs) that make KL >= 0 go through. Widely available through libraries and academic bookstores. Read after this lesson if you want the inequalities derived rather than asserted, and if you want the connection to coding theory.

  • “Pattern Recognition and Machine Learning” by Christopher Bishop (Springer), Section 1.6 (Information Theory). A more ML-oriented treatment of the same material, with cleaner notation and direct connections to the loss functions used in classification and regression. Section 1.6.1 covers relative entropy (KL); Section 1.6.2 covers mutual information. Widely available; the publisher has made a PDF freely downloadable in recent years (search for the book title and “PDF” to find the current canonical link).

Where this sits in the track.

  • Autoregressive models (previous lesson). L2 asserted “minimize the NLL” as the training objective. This lesson derived it from minimizing the forward KL. The same NLL, with the same per-piece chain-rule decomposition, comes out either way; this lesson gives the justification.

  • Normalizing flows (next lesson). Flows parameterize p_model(x) exactly through the change-of-variables formula and an invertible transformation. They train on the same empirical NLL = forward KL minimization this lesson derived. The KL view explains why flows and autoregressive models share an objective despite radically different architectures.

  • Variational autoencoders (lessons 5 and 6). VAEs cannot evaluate log p_model(x) directly because of an intractable integral over latents. They maximize a lower bound (the ELBO) on log p_model. The ELBO is the closest thing to forward-KL minimization when the exact log-likelihood is unavailable; lesson 5 derives it.

  • The four-paradigm landscape (lesson 15). This lesson sharpens the L1 map: three of the four paradigms (autoregressive, flow, VAE-via-ELBO) all reduce to forward-KL minimization; GANs explicitly do not; diffusion is connected through a chain of equivalences. The capstone returns to this synthesis.