Summary: Evaluating generative models

Phase 2 closes here, with the question that has been hovering over the last four lessons: how do you compare a VAE to a GAN to a diffusion model when likelihood is exact for some paradigms, bounded for others, and unavailable for the rest? The answer is a toolkit of sample-based and paradigm-specific metrics, not a single universal number. The whole lesson reduces to one line: likelihood is not a universal metric, so generative models are evaluated by a paradigm-specific suite (perplexity / FID / IS / precision-recall / human studies); each paradigm has its own fingerprint of instruments, and matching the metric to the deployment question matters more than reaching for the most famous one. This is the scan-it-in-five-minutes version.

Core ideas

Likelihood is not a universal cross-paradigm metric. Exact for autoregressive (chain rule) and flows (change of variables). Lower bound only for VAEs (the ELBO undershoots log p_model(x) by KL(q || p_posterior)). Unavailable for GANs (no density anywhere in the training pipeline). Indirect for diffusion (ELBO bound from training, exact NLL only at extra ODE-based cost). Even when present, likelihood can decouple from sample quality, and units differ across paradigms.
The Inception Score (IS) measures Inception’s view of generated samples: IS = exp(E[KL(p(y | x) || p(y))]). High when samples are individually classifiable AND the marginal class distribution is spread (diverse). Limits: does not compare to real data; gameable by Inception-friendly samples.
The Fréchet Inception Distance (FID) compares the distributions of generated and real Inception features as Gaussians: FID = ||μ_r − μ_g||² + Tr(Σ_r + Σ_g − 2·sqrt(Σ_r·Σ_g)). In 1D this collapses cleanly to FID_1D = (μ_r − μ_g)² + (σ_r − σ_g)², the sum of squared mean gap and squared standard-deviation gap. Limits: Inception bias (trained on ImageNet); needs ~10k+ samples to stabilize.
Worked 1D FID anchors: (0,1) vs (2,1) → FID = 4 (mean shifted by 2); (0,1) vs (0,2) → FID = 1 (variance one unit wider); (0,1) vs (2,2) → FID = 5 (both); (0,1) vs (0,1) → FID = 0 (match).
Precision and recall for distributions separate sample quality (precision = fraction of generated samples in the real-data region) from coverage (recall = fraction of the real-data region covered). FID collapses the two; precision-recall keeps them distinct.
Human preference studies are the ground truth for “does this look real?” Pairs of (real, generated) shown to raters; if they do no better than chance, the generator has succeeded. Expensive but unbiased.
Evaluation-methods-as-paradigm-fingerprint. Each paradigm has a characteristic suite: autoregressive → perplexity = exp(NLL/token); flow → NLL (bits-per-dim for images); VAE → ELBO + FID; GAN → FID/IS/precision-recall/human studies + (WGAN) critic’s Wasserstein estimate as training-stability instrument; diffusion → FID across step counts + ELBO-bound NLL + ODE-based exact NLL + CLIP score for text-image. Reading the suite reveals which questions a paradigm is positioned to answer.
Two practical patterns: match the metric to the deployment question (sample-realism vs coverage vs next-token surprise vs text-image alignment), and use multiple metrics to guard against gaming (any single metric can be over-fit if it’s the only training signal).

What changes for you

Before this lesson, evaluating a generative model probably meant “look at the test loss.” Now it means “what is the deployment question, which metric answers it for this paradigm, and how do I cross-check with a second metric to guard against gaming?” When you next read a paper or model release that reports a single headline metric (a FID of X, a perplexity of Y, an IS of Z), you can ask the right follow-up: what does this metric not measure, and what would a complementary metric have caught? Phase 3 opens next with energy-based models, the warm-up to score matching and the diffusion paradigm; the evaluation toolkit you just built will return in lesson 14 with the diffusion-specific instruments (FID across step counts, sample-quality-vs-step-count Pareto frontiers) the diffusion paradigm requires.