References: Maximum likelihood and the KL view
Source material
Section titled “Source material”Source curricula (multi-source structural mirror; cited as further study):
PRIMARY (this lesson follows its framing most directly)• Stanford CS236, "Deep Generative Models", Lecture 4: Maximum Likelihood Learning Instructor: Stefano Ermon Course URL: https://deepgenerativemodels.github.io/ Syllabus: https://deepgenerativemodels.github.io/syllabus.html License: standard course-page link-out; cited as further study
SECONDARY (parallel framing where applicable; CS294-158's parallel contentappears throughout L1-L6 of that course rather than as a single lecture on MLE)• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024) Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/ License: standard course-page link-out; cited as further study
Clawdemy's lessons are original prose that follows the pedagogical arc of thesetwo courses, anchored on CS236's lecture order with CS294-158 framing pulled inwhere its slide deck and recording are stronger. We do not reproduce ortranscribe the lectures; we cite them as the recommended companions. All rightsto the original course materials remain with the respective instructors andinstitutions.Watch this next
Section titled “Watch this next”-
Stanford CS236 (Stefano Ermon), course homepage. The primary anchor for this lesson; Lecture 4 (Maximum Likelihood Learning) walks the KL derivation, the forward-vs-reverse choice, and the connection to cross-entropy that this lesson mirrors. The course notes at deepgenerativemodels.github.io/notes include a written treatment with worked algebra that complements the lecture slides.
-
Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. The course’s parallel treatment of MLE appears across lectures L1-L4 rather than in a single dedicated lecture; the autoregressive (L2), flow (L3), and latent-variable (L4) lectures each invoke the same forward-KL = NLL identity that this lesson derives from first principles.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“Elements of Information Theory” by Thomas M. Cover and Joy A. Thomas (Wiley), especially Chapter 2 (Entropy, Relative Entropy, and Mutual Information). The canonical graduate-level treatment of KL divergence, cross-entropy, and the inequalities (Jensen, Gibbs) that make
KL >= 0go through. Widely available through libraries and academic bookstores. Read after this lesson if you want the inequalities derived rather than asserted, and if you want the connection to coding theory. -
“Pattern Recognition and Machine Learning” by Christopher Bishop (Springer), Section 1.6 (Information Theory). A more ML-oriented treatment of the same material, with cleaner notation and direct connections to the loss functions used in classification and regression. Section 1.6.1 covers relative entropy (KL); Section 1.6.2 covers mutual information. Widely available; the publisher has made a PDF freely downloadable in recent years (search for the book title and “PDF” to find the current canonical link).
Adjacent topics
Section titled “Adjacent topics”Where this sits in the track.
-
Autoregressive models (previous lesson). L2 asserted “minimize the NLL” as the training objective. This lesson derived it from minimizing the forward KL. The same NLL, with the same per-piece chain-rule decomposition, comes out either way; this lesson gives the justification.
-
Normalizing flows (next lesson). Flows parameterize
p_model(x)exactly through the change-of-variables formula and an invertible transformation. They train on the same empirical NLL = forward KL minimization this lesson derived. The KL view explains why flows and autoregressive models share an objective despite radically different architectures. -
Variational autoencoders (lessons 5 and 6). VAEs cannot evaluate
log p_model(x)directly because of an intractable integral over latents. They maximize a lower bound (the ELBO) onlog p_model. The ELBO is the closest thing to forward-KL minimization when the exact log-likelihood is unavailable; lesson 5 derives it. -
The four-paradigm landscape (lesson 15). This lesson sharpens the L1 map: three of the four paradigms (autoregressive, flow, VAE-via-ELBO) all reduce to forward-KL minimization; GANs explicitly do not; diffusion is connected through a chain of equivalences. The capstone returns to this synthesis.