References: Evaluating generative models

Source material

Source curricula (multi-source structural mirror; cited as further study):

PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 15: Evaluation of Generative Models
  Instructor: Stefano Ermon
  Course URL: https://deepgenerativemodels.github.io/
  Syllabus: https://deepgenerativemodels.github.io/syllabus.html
  License: standard course-page link-out; cited as further study

SECONDARY (parallel framing where applicable; CS294-158's coverage of evaluation
appears across the GAN, flow, and diffusion lectures rather than as a single
dedicated lecture)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024)
  Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
  Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
  License: standard course-page link-out; cited as further study

Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.

Watch this next

Stanford CS236 (Stefano Ermon), course homepage. Lecture 15 (Evaluation of Generative Models) is the primary anchor for this lesson; it covers IS, FID, precision/recall, and the cross-paradigm comparison framing. The course notes at deepgenerativemodels.github.io/notes include the FID derivation in more detail than the slides.
Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. CS294-158 distributes evaluation across the GAN lecture (L5), the flow lecture (L3), and the diffusion lecture (L6); reading those sections gives the paradigm-by-paradigm view this lesson abstracts into a single fingerprint table.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“Improved Techniques for Training GANs” (Salimans et al., 2016). The paper that introduced the Inception Score and the broader collection of training-and-evaluation tricks that came to define the early GAN era. IS is in Section 4; the rest of the paper covers training stability moves (feature matching, mini-batch discrimination, virtual batch normalization) that preceded WGAN-GP’s stability fixes.
“GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium” (Heusel et al., 2017). The paper that introduced FID (Fréchet Inception Distance) alongside the two-time-scale GAN training rule. Section 6 has the FID derivation and the original empirical comparison; Section 7 has the convergence results that justified the FID’s sensitivity properties.
“Improved Precision and Recall Metric for Assessing Generative Models” (Kynkäänniemi et al., 2019). The current standard precision-recall metric for distributions, which improved on the original Sajjadi et al. (2018) formulation. Useful when you need to separate quality from coverage rather than collapsing them into a single FID number.
“A Note on the Inception Score” (Barratt and Sharma, 2018). A short, sharp critique of IS that explains why it can be gamed and what it misses. Read after the original IS paper to calibrate expectations; this is the standard reference for “do not use IS as your only metric.”

Adjacent topics

Where this sits in the track.

Maximum likelihood and the KL view (L3). L3 established likelihood as the natural training objective for the paradigms that can compute it. This lesson is the dual side: how do you compare paradigms that cannot all give you a likelihood number? The answer (sample-based metrics + paradigm-specific instruments) is exactly the kind of metric L3 said would be needed when likelihood is unavailable.
VAE training in practice (L6). L6 noted that a VAE’s reported likelihood is the ELBO (a lower bound). This lesson formalizes that: VAE evaluation needs ELBO PLUS sample-based metrics, because the bound alone is not enough information.
GANs (L7) and WGAN-GP (L8). Both lessons named FID and IS as the evaluation methods for adversarial training (and named the Wasserstein critic estimate as the WGAN training-stability instrument). This lesson builds out those metrics with full formulas and worked examples.
Diffusion models (L12-14, coming). Diffusion adds a step-count axis to image-generation evaluation: FID at different step counts and the sample-quality-vs-step-count Pareto frontier are the standard way to characterize a diffusion model’s trade-offs. This lesson’s evaluation-methods-as-paradigm-fingerprint framing makes those diffusion-specific metrics easier to read when they appear.
The four-paradigm landscape (L15). The capstone returns to L1’s map with all paradigms fully unpacked. The evaluation table from this lesson is one of the cross-paradigm tools the capstone will lean on to compare modern systems (Stable Diffusion, autoregressive LLMs, GAN-family) on a common footing.