Summary: Evaluating generative models
Phase 2 closes here, with the question that has been hovering over the last four lessons: how do you compare a VAE to a GAN to a diffusion model when likelihood is exact for some paradigms, bounded for others, and unavailable for the rest? The answer is a toolkit of sample-based and paradigm-specific metrics, not a single universal number. The whole lesson reduces to one line: likelihood is not a universal metric, so generative models are evaluated by a paradigm-specific suite (perplexity / FID / IS / precision-recall / human studies); each paradigm has its own fingerprint of instruments, and matching the metric to the deployment question matters more than reaching for the most famous one. This is the scan-it-in-five-minutes version.
Core ideas
Section titled “Core ideas”- Likelihood is not a universal cross-paradigm metric. Exact for autoregressive (chain rule) and flows (change of variables). Lower bound only for VAEs (the ELBO undershoots
log p_model(x)byKL(q || p_posterior)). Unavailable for GANs (no density anywhere in the training pipeline). Indirect for diffusion (ELBO bound from training, exact NLL only at extra ODE-based cost). Even when present, likelihood can decouple from sample quality, and units differ across paradigms. - The Inception Score (IS) measures Inception’s view of generated samples:
IS = exp(E[KL(p(y | x) || p(y))]). High when samples are individually classifiable AND the marginal class distribution is spread (diverse). Limits: does not compare to real data; gameable by Inception-friendly samples. - The Fréchet Inception Distance (FID) compares the distributions of generated and real Inception features as Gaussians:
FID = ||μ_r − μ_g||² + Tr(Σ_r + Σ_g − 2·sqrt(Σ_r·Σ_g)). In 1D this collapses cleanly toFID_1D = (μ_r − μ_g)² + (σ_r − σ_g)², the sum of squared mean gap and squared standard-deviation gap. Limits: Inception bias (trained on ImageNet); needs ~10k+ samples to stabilize. - Worked 1D FID anchors:
(0,1)vs(2,1)→FID = 4(mean shifted by 2);(0,1)vs(0,2)→FID = 1(variance one unit wider);(0,1)vs(2,2)→FID = 5(both);(0,1)vs(0,1)→FID = 0(match). - Precision and recall for distributions separate sample quality (precision = fraction of generated samples in the real-data region) from coverage (recall = fraction of the real-data region covered). FID collapses the two; precision-recall keeps them distinct.
- Human preference studies are the ground truth for “does this look real?” Pairs of (real, generated) shown to raters; if they do no better than chance, the generator has succeeded. Expensive but unbiased.
- Evaluation-methods-as-paradigm-fingerprint. Each paradigm has a characteristic suite: autoregressive → perplexity =
exp(NLL/token); flow → NLL (bits-per-dim for images); VAE → ELBO + FID; GAN → FID/IS/precision-recall/human studies + (WGAN) critic’s Wasserstein estimate as training-stability instrument; diffusion → FID across step counts + ELBO-bound NLL + ODE-based exact NLL + CLIP score for text-image. Reading the suite reveals which questions a paradigm is positioned to answer. - Two practical patterns: match the metric to the deployment question (sample-realism vs coverage vs next-token surprise vs text-image alignment), and use multiple metrics to guard against gaming (any single metric can be over-fit if it’s the only training signal).
What changes for you
Section titled “What changes for you”Before this lesson, evaluating a generative model probably meant “look at the test loss.” Now it means “what is the deployment question, which metric answers it for this paradigm, and how do I cross-check with a second metric to guard against gaming?” When you next read a paper or model release that reports a single headline metric (a FID of X, a perplexity of Y, an IS of Z), you can ask the right follow-up: what does this metric not measure, and what would a complementary metric have caught? Phase 3 opens next with energy-based models, the warm-up to score matching and the diffusion paradigm; the evaluation toolkit you just built will return in lesson 14 with the diffusion-specific instruments (FID across step counts, sample-quality-vs-step-count Pareto frontiers) the diffusion paradigm requires.