Evaluating generative models: cheatsheet

Why likelihood is not a universal metric

Paradigm	Likelihood available?
Autoregressive	Exact (chain rule)
Normalizing flows	Exact (change of variables)
VAE	Lower bound only (the ELBO)
GAN	Not at all (no density anywhere)
Diffusion	Indirect (ELBO bound from training; ODE-based exact at extra cost)

Even when present: likelihood and sample quality can decouple; units differ across paradigms (per-token NLL vs per-pixel bits-per-dim). Cross-paradigm likelihood comparisons are usually meaningless.

Inception Score (IS)

IS = exp(  E_{x ~ p_model}[ KL( p(y | x)  ||  p(y) ) ]  )

p(y | x) is Inception’s class distribution for sample x; p(y) is the marginal. Higher when samples are individually classifiable AND collectively diverse.

Measures well	Misses
Per-sample classifiability (rough quality proxy)	Realism vs the actual data distribution
Class diversity (rough coverage proxy)	Easily gamed by Inception-friendly samples

Fréchet Inception Distance (FID)

Extract Inception features from real and generated images; fit Gaussians; compute Fréchet distance between them.

FID = ||μ_r − μ_g||²  +  Tr( Σ_r + Σ_g − 2 · sqrt(Σ_r · Σ_g) )

Lower is better; FID = 0 iff Gaussian-projected feature distributions match.

1D FID collapse (worked example)

In 1D: sqrt(σ_r² · σ_g²) = σ_r·σ_g, so the trace term becomes (σ_r − σ_g)².

FID_1D  =  (μ_r − μ_g)²  +  (σ_r − σ_g)²

Real (μ_r, σ_r)	Generated (μ_g, σ_g)	FID_1D
(0, 1)	(0, 1)	`0` (match)
(0, 1)	(1, 1)	`1` (mean shifted)
(0, 1)	(1, 1.5)	`1.25` (mean shifted + variance wider)

Measures well	Misses
Distance between generated and real feature distributions	Inception bias (trained on ImageNet; out-of-distribution domains less reliable)
Sensitive to mode dropping (covariance mismatch)	Needs ~10k+ samples to stabilize

Beyond image quality

Tool	What it adds
Precision / Recall for distributions	Separates sample quality (precision: fraction inside real region) from coverage (recall: fraction of real region reached)
Human preference studies	Ground truth for “does this look real?”; expensive but unbiased
Task-specific metrics	BLEU/ROUGE (text), FAD (audio), CLIP score (text-image alignment)

Evaluation-methods-as-paradigm-fingerprint

Paradigm	Primary evaluation	Secondary	Cannot directly measure
Autoregressive (LLMs, PixelRNN)	Perplexity = exp(NLL/token)	BLEU/ROUGE; bits-per-dim	Sample quality decoupled from likelihood
Normalizing flows	NLL (bits-per-dim for images)	FID	Inference latency in real-time cases
VAEs	ELBO (lower bound on NLL)	FID, reconstruction MSE	Exact likelihood
GANs (original + WGAN-GP)	FID, IS, precision/recall, human studies	WGAN critic’s Wasserstein estimate (training stability)	Any likelihood number
Diffusion	FID across step counts; CLIP score; quality-vs-step Pareto	ELBO bound on NLL; ODE-based exact NLL (extra cost)	Single-step sampling speed

Match the metric to the deployment question. Use multiple metrics to guard against gaming.

How to choose a metric

What paradigm is the model? Determines which metrics are available at all.
What deployment question are you answering? Sample-realism (FID + human studies); coverage (FID + precision/recall); next-token surprise (perplexity); text-image alignment (CLIP score).
How many samples can you generate? FID needs ~10k+; IS is more forgiving; human studies need careful sample selection.
Are you comparing across paradigms? Paradigm-agnostic metric only (FID, IS, human studies); never likelihood numbers.

Pitfalls to dodge

Comparing likelihoods across paradigms. Units differ; some paradigms have only a bound; some have none.
Treating one metric as ground truth. Each measures one aspect; use multiple.
FID without enough samples. Below ~10k, the metric is noisy. Variance can swamp signal.
High training likelihood = good generation. No. Memorization gets high training likelihood with no useful generalization. Use held-out likelihood AND sample quality on novel inputs.

The one-line version

Likelihood is not a universal metric, so generative models are evaluated by a paradigm-specific suite (perplexity / FID / IS / precision-recall / human studies); each paradigm has its own fingerprint of instruments, and matching the metric to the deployment question is more important than reaching for the most famous one.