Evaluating generative models: brief

What you’ll learn

This is lesson 9 of Track 19 (Generative Models and Diffusion), and it closes Phase 2 (latent-variable and adversarial paradigms). By the end you will know why likelihood is not a universal cross-paradigm metric (different paradigms compute it differently, or not at all), how the standard sample-based image metrics (Inception Score, Fréchet Inception Distance) work, what each measures and misses, how precision and recall for distributions separate sample quality from coverage, where human preference studies sit in the hierarchy, and how to read the evaluation-methods-as-paradigm-fingerprint pattern across all four paradigms in the track. The source curriculum is Stanford CS236 Lecture 15.

Where this fits

This is lesson 9 of 15, the fifth and last lesson of Phase 2. Phase 2 covered the latent-variable and adversarial paradigms (VAEs in lessons 5-6, GANs in lessons 7-8); this lesson closes the phase by handling the cross-paradigm comparison question those paradigms surface. Phase 3 opens with lesson 10 (energy-based models, the partition-function problem), then score matching (L11), full diffusion in three lessons (L12-14, where the evaluation toolkit from this lesson will return with diffusion-specific instruments), and the synthesis capstone at L15.

Before you start

Prerequisites: all of Phase 2, especially L6 (VAE, ELBO as lower bound) and L8 (WGAN-GP, the critic’s Wasserstein estimate as training-stability instrument). L3’s forward-KL framework is reused implicitly. Math background: comfort with expectations, KL divergence, and multivariate-Gaussian parameter intuition (mean and covariance); no calculus or new derivations are introduced in this lesson.

About the math

This lesson has the lightest math density in Phase 2. The IS formula is one equation; the FID formula collapses cleanly to a closed form in 1D that you can compute by hand; the practice extends both to slightly larger cases. The remaining content is conceptual placement: which metric measures what, which paradigm uses which metric, when to use multiple metrics, and what each metric misses. The deeper pedagogical work is on the paradigm-fingerprint framing, which is conceptual rather than algebraic.

By the end, you’ll be able to

Explain why likelihood is not a universal cross-paradigm evaluation metric (paradigms differ in whether and how they compute it; units differ; quality can decouple from likelihood)
Compute the Inception Score (IS) from per-sample class distributions and a marginal, and explain what it measures and misses
Compute the Fréchet Inception Distance (FID) using the 1D closed-form FID = (μ_r − μ_g)² + (σ_r − σ_g)², and explain what FID measures and misses
Distinguish precision (sample quality) from recall (coverage) for distributions, and place human preference studies in the metric hierarchy
Apply the evaluation-methods-as-paradigm-fingerprint framing to match the right metric suite to a given paradigm and deployment question

Time and difficulty

Read time: about 14 minutes
Practice time: about 16 minutes (a six-question self-check, four 1D FID computations on different mean/variance settings, an Inception Score computation on a 3-class 4-sample case, and flashcards)
Difficulty: standard (the lightest math density in Phase 2; one IS formula, one closed-form 1D FID; the work is conceptual placement and cross-paradigm fingerprint reading)