Cheatsheet: Evaluating generative models
Why likelihood is not a universal metric
Section titled “Why likelihood is not a universal metric”| Paradigm | Likelihood available? |
|---|---|
| Autoregressive | Exact (chain rule) |
| Normalizing flows | Exact (change of variables) |
| VAE | Lower bound only (the ELBO) |
| GAN | Not at all (no density anywhere) |
| Diffusion | Indirect (ELBO bound from training; ODE-based exact at extra cost) |
Even when present: likelihood and sample quality can decouple; units differ across paradigms (per-token NLL vs per-pixel bits-per-dim). Cross-paradigm likelihood comparisons are usually meaningless.
Inception Score (IS)
Section titled “Inception Score (IS)”IS = exp( E_{x ~ p_model}[ KL( p(y | x) || p(y) ) ] )p(y | x) is Inception’s class distribution for sample x; p(y) is the marginal. Higher when samples are individually classifiable AND collectively diverse.
| Measures well | Misses |
|---|---|
| Per-sample classifiability (rough quality proxy) | Realism vs the actual data distribution |
| Class diversity (rough coverage proxy) | Easily gamed by Inception-friendly samples |
Fréchet Inception Distance (FID)
Section titled “Fréchet Inception Distance (FID)”Extract Inception features from real and generated images; fit Gaussians; compute Fréchet distance between them.
FID = ||μ_r − μ_g||² + Tr( Σ_r + Σ_g − 2 · sqrt(Σ_r · Σ_g) )Lower is better; FID = 0 iff Gaussian-projected feature distributions match.
1D FID collapse (worked example)
Section titled “1D FID collapse (worked example)”In 1D: sqrt(σ_r² · σ_g²) = σ_r·σ_g, so the trace term becomes (σ_r − σ_g)².
FID_1D = (μ_r − μ_g)² + (σ_r − σ_g)²| Real (μ_r, σ_r) | Generated (μ_g, σ_g) | FID_1D |
|---|---|---|
| (0, 1) | (0, 1) | 0 (match) |
| (0, 1) | (1, 1) | 1 (mean shifted) |
| (0, 1) | (1, 1.5) | 1.25 (mean shifted + variance wider) |
| Measures well | Misses |
|---|---|
| Distance between generated and real feature distributions | Inception bias (trained on ImageNet; out-of-distribution domains less reliable) |
| Sensitive to mode dropping (covariance mismatch) | Needs ~10k+ samples to stabilize |
Beyond image quality
Section titled “Beyond image quality”| Tool | What it adds |
|---|---|
| Precision / Recall for distributions | Separates sample quality (precision: fraction inside real region) from coverage (recall: fraction of real region reached) |
| Human preference studies | Ground truth for “does this look real?”; expensive but unbiased |
| Task-specific metrics | BLEU/ROUGE (text), FAD (audio), CLIP score (text-image alignment) |
Evaluation-methods-as-paradigm-fingerprint
Section titled “Evaluation-methods-as-paradigm-fingerprint”| Paradigm | Primary evaluation | Secondary | Cannot directly measure |
|---|---|---|---|
| Autoregressive (LLMs, PixelRNN) | Perplexity = exp(NLL/token) | BLEU/ROUGE; bits-per-dim | Sample quality decoupled from likelihood |
| Normalizing flows | NLL (bits-per-dim for images) | FID | Inference latency in real-time cases |
| VAEs | ELBO (lower bound on NLL) | FID, reconstruction MSE | Exact likelihood |
| GANs (original + WGAN-GP) | FID, IS, precision/recall, human studies | WGAN critic’s Wasserstein estimate (training stability) | Any likelihood number |
| Diffusion | FID across step counts; CLIP score; quality-vs-step Pareto | ELBO bound on NLL; ODE-based exact NLL (extra cost) | Single-step sampling speed |
Match the metric to the deployment question. Use multiple metrics to guard against gaming.
How to choose a metric
Section titled “How to choose a metric”- What paradigm is the model? Determines which metrics are available at all.
- What deployment question are you answering? Sample-realism (FID + human studies); coverage (FID + precision/recall); next-token surprise (perplexity); text-image alignment (CLIP score).
- How many samples can you generate? FID needs ~10k+; IS is more forgiving; human studies need careful sample selection.
- Are you comparing across paradigms? Paradigm-agnostic metric only (FID, IS, human studies); never likelihood numbers.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Comparing likelihoods across paradigms. Units differ; some paradigms have only a bound; some have none.
- Treating one metric as ground truth. Each measures one aspect; use multiple.
- FID without enough samples. Below ~10k, the metric is noisy. Variance can swamp signal.
- High training likelihood = good generation. No. Memorization gets high training likelihood with no useful generalization. Use held-out likelihood AND sample quality on novel inputs.
The one-line version
Section titled “The one-line version”Likelihood is not a universal metric, so generative models are evaluated by a paradigm-specific suite (perplexity / FID / IS / precision-recall / human studies); each paradigm has its own fingerprint of instruments, and matching the metric to the deployment question is more important than reaching for the most famous one.