References: Maximum likelihood and the KL view

Source material

Source curricula (multi-source structural mirror; cited as further study):

PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 4: Maximum Likelihood Learning
  Instructor: Stefano Ermon
  Course URL: https://deepgenerativemodels.github.io/
  Syllabus: https://deepgenerativemodels.github.io/syllabus.html
  License: standard course-page link-out; cited as further study

SECONDARY (parallel framing where applicable; CS294-158's parallel content
appears throughout L1-L6 of that course rather than as a single lecture on MLE)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024)
  Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
  Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
  License: standard course-page link-out; cited as further study

Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.

Watch this next

Stanford CS236 (Stefano Ermon), course homepage. The primary anchor for this lesson; Lecture 4 (Maximum Likelihood Learning) walks the KL derivation, the forward-vs-reverse choice, and the connection to cross-entropy that this lesson mirrors. The course notes at deepgenerativemodels.github.io/notes include a written treatment with worked algebra that complements the lecture slides.
Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. The course’s parallel treatment of MLE appears across lectures L1-L4 rather than in a single dedicated lecture; the autoregressive (L2), flow (L3), and latent-variable (L4) lectures each invoke the same forward-KL = NLL identity that this lesson derives from first principles.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“Elements of Information Theory” by Thomas M. Cover and Joy A. Thomas (Wiley), especially Chapter 2 (Entropy, Relative Entropy, and Mutual Information). The canonical graduate-level treatment of KL divergence, cross-entropy, and the inequalities (Jensen, Gibbs) that make KL >= 0 go through. Widely available through libraries and academic bookstores. Read after this lesson if you want the inequalities derived rather than asserted, and if you want the connection to coding theory.
“Pattern Recognition and Machine Learning” by Christopher Bishop (Springer), Section 1.6 (Information Theory). A more ML-oriented treatment of the same material, with cleaner notation and direct connections to the loss functions used in classification and regression. Section 1.6.1 covers relative entropy (KL); Section 1.6.2 covers mutual information. Widely available; the publisher has made a PDF freely downloadable in recent years (search for the book title and “PDF” to find the current canonical link).

Adjacent topics

Where this sits in the track.

Autoregressive models (previous lesson). L2 asserted “minimize the NLL” as the training objective. This lesson derived it from minimizing the forward KL. The same NLL, with the same per-piece chain-rule decomposition, comes out either way; this lesson gives the justification.
Normalizing flows (next lesson). Flows parameterize p_model(x) exactly through the change-of-variables formula and an invertible transformation. They train on the same empirical NLL = forward KL minimization this lesson derived. The KL view explains why flows and autoregressive models share an objective despite radically different architectures.
Variational autoencoders (lessons 5 and 6). VAEs cannot evaluate log p_model(x) directly because of an intractable integral over latents. They maximize a lower bound (the ELBO) on log p_model. The ELBO is the closest thing to forward-KL minimization when the exact log-likelihood is unavailable; lesson 5 derives it.
The four-paradigm landscape (lesson 15). This lesson sharpens the L1 map: three of the four paradigms (autoregressive, flow, VAE-via-ELBO) all reduce to forward-KL minimization; GANs explicitly do not; diffusion is connected through a chain of equivalences. The capstone returns to this synthesis.