References: Evaluating generative models
Source material
Section titled “Source material”Source curricula (multi-source structural mirror; cited as further study):
PRIMARY (this lesson follows its framing most directly)• Stanford CS236, "Deep Generative Models", Lecture 15: Evaluation of Generative Models Instructor: Stefano Ermon Course URL: https://deepgenerativemodels.github.io/ Syllabus: https://deepgenerativemodels.github.io/syllabus.html License: standard course-page link-out; cited as further study
SECONDARY (parallel framing where applicable; CS294-158's coverage of evaluationappears across the GAN, flow, and diffusion lectures rather than as a singlededicated lecture)• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024) Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/ License: standard course-page link-out; cited as further study
Clawdemy's lessons are original prose that follows the pedagogical arc of thesetwo courses, anchored on CS236's lecture order with CS294-158 framing pulled inwhere its slide deck and recording are stronger. We do not reproduce ortranscribe the lectures; we cite them as the recommended companions. All rightsto the original course materials remain with the respective instructors andinstitutions.Watch this next
Section titled “Watch this next”-
Stanford CS236 (Stefano Ermon), course homepage. Lecture 15 (Evaluation of Generative Models) is the primary anchor for this lesson; it covers IS, FID, precision/recall, and the cross-paradigm comparison framing. The course notes at deepgenerativemodels.github.io/notes include the FID derivation in more detail than the slides.
-
Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. CS294-158 distributes evaluation across the GAN lecture (L5), the flow lecture (L3), and the diffusion lecture (L6); reading those sections gives the paradigm-by-paradigm view this lesson abstracts into a single fingerprint table.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“Improved Techniques for Training GANs” (Salimans et al., 2016). The paper that introduced the Inception Score and the broader collection of training-and-evaluation tricks that came to define the early GAN era. IS is in Section 4; the rest of the paper covers training stability moves (feature matching, mini-batch discrimination, virtual batch normalization) that preceded WGAN-GP’s stability fixes.
-
“GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium” (Heusel et al., 2017). The paper that introduced FID (Fréchet Inception Distance) alongside the two-time-scale GAN training rule. Section 6 has the FID derivation and the original empirical comparison; Section 7 has the convergence results that justified the FID’s sensitivity properties.
-
“Improved Precision and Recall Metric for Assessing Generative Models” (Kynkäänniemi et al., 2019). The current standard precision-recall metric for distributions, which improved on the original Sajjadi et al. (2018) formulation. Useful when you need to separate quality from coverage rather than collapsing them into a single FID number.
-
“A Note on the Inception Score” (Barratt and Sharma, 2018). A short, sharp critique of IS that explains why it can be gamed and what it misses. Read after the original IS paper to calibrate expectations; this is the standard reference for “do not use IS as your only metric.”
Adjacent topics
Section titled “Adjacent topics”Where this sits in the track.
-
Maximum likelihood and the KL view (L3). L3 established likelihood as the natural training objective for the paradigms that can compute it. This lesson is the dual side: how do you compare paradigms that cannot all give you a likelihood number? The answer (sample-based metrics + paradigm-specific instruments) is exactly the kind of metric L3 said would be needed when likelihood is unavailable.
-
VAE training in practice (L6). L6 noted that a VAE’s reported likelihood is the ELBO (a lower bound). This lesson formalizes that: VAE evaluation needs ELBO PLUS sample-based metrics, because the bound alone is not enough information.
-
GANs (L7) and WGAN-GP (L8). Both lessons named FID and IS as the evaluation methods for adversarial training (and named the Wasserstein critic estimate as the WGAN training-stability instrument). This lesson builds out those metrics with full formulas and worked examples.
-
Diffusion models (L12-14, coming). Diffusion adds a step-count axis to image-generation evaluation: FID at different step counts and the sample-quality-vs-step-count Pareto frontier are the standard way to characterize a diffusion model’s trade-offs. This lesson’s evaluation-methods-as-paradigm-fingerprint framing makes those diffusion-specific metrics easier to read when they appear.
-
The four-paradigm landscape (L15). The capstone returns to L1’s map with all paradigms fully unpacked. The evaluation table from this lesson is one of the cross-paradigm tools the capstone will lean on to compare modern systems (Stable Diffusion, autoregressive LLMs, GAN-family) on a common footing.