Skip to content

Cheatsheet: Evaluating generative models

ParadigmLikelihood available?
AutoregressiveExact (chain rule)
Normalizing flowsExact (change of variables)
VAELower bound only (the ELBO)
GANNot at all (no density anywhere)
DiffusionIndirect (ELBO bound from training; ODE-based exact at extra cost)

Even when present: likelihood and sample quality can decouple; units differ across paradigms (per-token NLL vs per-pixel bits-per-dim). Cross-paradigm likelihood comparisons are usually meaningless.

IS = exp( E_{x ~ p_model}[ KL( p(y | x) || p(y) ) ] )

p(y | x) is Inception’s class distribution for sample x; p(y) is the marginal. Higher when samples are individually classifiable AND collectively diverse.

Measures wellMisses
Per-sample classifiability (rough quality proxy)Realism vs the actual data distribution
Class diversity (rough coverage proxy)Easily gamed by Inception-friendly samples

Extract Inception features from real and generated images; fit Gaussians; compute Fréchet distance between them.

FID = ||μ_r − μ_g||² + Tr( Σ_r + Σ_g − 2 · sqrt(Σ_r · Σ_g) )

Lower is better; FID = 0 iff Gaussian-projected feature distributions match.

In 1D: sqrt(σ_r² · σ_g²) = σ_r·σ_g, so the trace term becomes (σ_r − σ_g)².

FID_1D = (μ_r − μ_g)² + (σ_r − σ_g)²
Real (μ_r, σ_r)Generated (μ_g, σ_g)FID_1D
(0, 1)(0, 1)0 (match)
(0, 1)(1, 1)1 (mean shifted)
(0, 1)(1, 1.5)1.25 (mean shifted + variance wider)
Measures wellMisses
Distance between generated and real feature distributionsInception bias (trained on ImageNet; out-of-distribution domains less reliable)
Sensitive to mode dropping (covariance mismatch)Needs ~10k+ samples to stabilize
ToolWhat it adds
Precision / Recall for distributionsSeparates sample quality (precision: fraction inside real region) from coverage (recall: fraction of real region reached)
Human preference studiesGround truth for “does this look real?”; expensive but unbiased
Task-specific metricsBLEU/ROUGE (text), FAD (audio), CLIP score (text-image alignment)

Evaluation-methods-as-paradigm-fingerprint

Section titled “Evaluation-methods-as-paradigm-fingerprint”
ParadigmPrimary evaluationSecondaryCannot directly measure
Autoregressive (LLMs, PixelRNN)Perplexity = exp(NLL/token)BLEU/ROUGE; bits-per-dimSample quality decoupled from likelihood
Normalizing flowsNLL (bits-per-dim for images)FIDInference latency in real-time cases
VAEsELBO (lower bound on NLL)FID, reconstruction MSEExact likelihood
GANs (original + WGAN-GP)FID, IS, precision/recall, human studiesWGAN critic’s Wasserstein estimate (training stability)Any likelihood number
DiffusionFID across step counts; CLIP score; quality-vs-step ParetoELBO bound on NLL; ODE-based exact NLL (extra cost)Single-step sampling speed

Match the metric to the deployment question. Use multiple metrics to guard against gaming.

  1. What paradigm is the model? Determines which metrics are available at all.
  2. What deployment question are you answering? Sample-realism (FID + human studies); coverage (FID + precision/recall); next-token surprise (perplexity); text-image alignment (CLIP score).
  3. How many samples can you generate? FID needs ~10k+; IS is more forgiving; human studies need careful sample selection.
  4. Are you comparing across paradigms? Paradigm-agnostic metric only (FID, IS, human studies); never likelihood numbers.
  • Comparing likelihoods across paradigms. Units differ; some paradigms have only a bound; some have none.
  • Treating one metric as ground truth. Each measures one aspect; use multiple.
  • FID without enough samples. Below ~10k, the metric is noisy. Variance can swamp signal.
  • High training likelihood = good generation. No. Memorization gets high training likelihood with no useful generalization. Use held-out likelihood AND sample quality on novel inputs.

Likelihood is not a universal metric, so generative models are evaluated by a paradigm-specific suite (perplexity / FID / IS / precision-recall / human studies); each paradigm has its own fingerprint of instruments, and matching the metric to the deployment question is more important than reaching for the most famous one.