Skip to content

References: Evaluating generative models

Source curricula (multi-source structural mirror; cited as further study):
PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 15: Evaluation of Generative Models
Instructor: Stefano Ermon
Course URL: https://deepgenerativemodels.github.io/
Syllabus: https://deepgenerativemodels.github.io/syllabus.html
License: standard course-page link-out; cited as further study
SECONDARY (parallel framing where applicable; CS294-158's coverage of evaluation
appears across the GAN, flow, and diffusion lectures rather than as a single
dedicated lecture)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024)
Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
License: standard course-page link-out; cited as further study
Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.

A short, durable list. Each link is a specific next step, not a generic pile.

Where this sits in the track.

  • Maximum likelihood and the KL view (L3). L3 established likelihood as the natural training objective for the paradigms that can compute it. This lesson is the dual side: how do you compare paradigms that cannot all give you a likelihood number? The answer (sample-based metrics + paradigm-specific instruments) is exactly the kind of metric L3 said would be needed when likelihood is unavailable.

  • VAE training in practice (L6). L6 noted that a VAE’s reported likelihood is the ELBO (a lower bound). This lesson formalizes that: VAE evaluation needs ELBO PLUS sample-based metrics, because the bound alone is not enough information.

  • GANs (L7) and WGAN-GP (L8). Both lessons named FID and IS as the evaluation methods for adversarial training (and named the Wasserstein critic estimate as the WGAN training-stability instrument). This lesson builds out those metrics with full formulas and worked examples.

  • Diffusion models (L12-14, coming). Diffusion adds a step-count axis to image-generation evaluation: FID at different step counts and the sample-quality-vs-step-count Pareto frontier are the standard way to characterize a diffusion model’s trade-offs. This lesson’s evaluation-methods-as-paradigm-fingerprint framing makes those diffusion-specific metrics easier to read when they appear.

  • The four-paradigm landscape (L15). The capstone returns to L1’s map with all paradigms fully unpacked. The evaluation table from this lesson is one of the cross-paradigm tools the capstone will lean on to compare modern systems (Stable Diffusion, autoregressive LLMs, GAN-family) on a common footing.