Generative model paradigms: cheatsheet

The distinction

Model type	Learns	Lets you
Discriminative	`p(y \| x)` (label given input)	Classify / predict
Generative	`p(x)` (or `p(x \| c)`)	Compute likelihood AND sample new data

A model is generative whenever it can sample new examples from a learned distribution. “Generate an image of X” = “sample from p(image \| text = X).”

The four-paradigm map

Paradigm	Training objective	Sampling procedure	Likelihood?	Famous in
Autoregressive	Maximize next-piece log-likelihood (chain rule)	Sequential, one piece at a time	Yes, exact	Modern LLMs; PixelRNN/WaveNet
Latent-variable	Maximize the ELBO (lower bound on likelihood)	Sample latent `z`, run decoder once	Yes, bounded	VAEs; pre-diffusion image gens
Adversarial (GAN)	Minimax game (G vs D)	Sample latent, run generator once	No	StyleGAN; older image / face gens
Score-based / diffusion	Noise-prediction MSE at every step	Multi-step denoising from pure noise	Indirectly (via score)	Stable Diffusion; modern image / video / audio

Each paradigm’s training objective and sampling procedure are tightly coupled to how it represents the underlying distribution.

Chain rule (autoregressive)

p(x_1, ..., x_n) = p(x_1) · p(x_2 | x_1) · ... · p(x_n | x_1, ..., x_{n-1})

Next-token prediction implements this term by term. Sampling: emit one token at a time; latency scales with output length.

Sample-and-decode (latent-variable)

z ~ p(z)          (e.g., standard Gaussian)
x = decoder(z)

Training is harder because p(x) = ∫ p(x|z) p(z) dz is intractable; the ELBO (lesson 5) is a lower bound that is.

Two-network game (GAN)

Generator tries to fool a Discriminator; Discriminator tries to tell fakes from reals. At equilibrium the generator’s samples are indistinguishable from real data. No likelihood is ever computed.

Denoise from noise (diffusion)

Forward:  data + noise + noise + noise + ... -> pure noise (cheap to simulate)
Reverse:  pure noise - predicted_noise - predicted_noise - ... -> data (learned)

Network learns to predict the noise at each step. Approximates following the score = ∇ log p(x).

Place a modern system on the map

System	Training objective	Sampling procedure	Paradigm
Chat-style LLM	Next-token cross-entropy	One token at a time	Autoregressive
Stable Diffusion	Noise prediction at each step	Multi-step denoising	Diffusion
StyleGAN face gen	Adversarial loss	One generator pass	GAN

Read the abstract for the training objective; read the architecture for the sampling procedure; the pair identifies the paradigm.

Where the trade-offs sit

Sampling speed: Autoregressive scales with output length; Diffusion scales with step count; GAN/VAE are one pass.
Likelihood: Exact for autoregressive + flows; lower bound for VAEs; undefined for GANs; via score for diffusion.
Failure modes: Autoregressive drifts on long outputs; GANs collapse to few modes; Diffusion needs many steps for the last quality fraction.
Hybrids: Latent diffusion = VAE encoder + diffusion in latent space. Decompose hybrids paradigm by paradigm; do not throw the map away.

Pitfalls to dodge

“Generative AI” as one thing. No, it is four paradigms with shared output behavior.
Paradigm tied to modality. No, each modality has been generated by every paradigm.
Sample quality as sole criterion. No, speed / controllability / latent structure / exact likelihood are real trade-offs.
Hybrid as contradiction. No, decompose into its component paradigms.

Words to use precisely

Discriminative vs generative: labels-given-input vs learns-the-distribution.
Autoregressive: factor by the chain rule; next-piece prediction.
Latent variable (z): a hidden code in a lower-dimensional space.
ELBO: evidence lower bound, the trainable surrogate for log-likelihood in latent-variable models.
Discriminator vs generator (GAN roles).
Score: ∇ log p(x), the gradient of log-density; the object diffusion implicitly learns.
Forward / reverse process (diffusion’s noising and denoising chains).

The one-line version

A generative model learns a distribution well enough to sample from it, and there are four ways to do that (autoregressive, latent-variable, adversarial, score-based / diffusion); every modern system you have heard of is one of them.