Cheatsheet: What a generative model is, and the four-paradigm map
The distinction
Section titled “The distinction”| Model type | Learns | Lets you |
|---|---|---|
| Discriminative | p(y | x) (label given input) | Classify / predict |
| Generative | p(x) (or p(x | c)) | Compute likelihood AND sample new data |
A model is generative whenever it can sample new examples from a learned distribution. “Generate an image of X” = “sample from p(image \| text = X).”
The four-paradigm map
Section titled “The four-paradigm map”| Paradigm | Training objective | Sampling procedure | Likelihood? | Famous in |
|---|---|---|---|---|
| Autoregressive | Maximize next-piece log-likelihood (chain rule) | Sequential, one piece at a time | Yes, exact | Modern LLMs; PixelRNN/WaveNet |
| Latent-variable | Maximize the ELBO (lower bound on likelihood) | Sample latent z, run decoder once | Yes, bounded | VAEs; pre-diffusion image gens |
| Adversarial (GAN) | Minimax game (G vs D) | Sample latent, run generator once | No | StyleGAN; older image / face gens |
| Score-based / diffusion | Noise-prediction MSE at every step | Multi-step denoising from pure noise | Indirectly (via score) | Stable Diffusion; modern image / video / audio |
Each paradigm’s training objective and sampling procedure are tightly coupled to how it represents the underlying distribution.
Chain rule (autoregressive)
Section titled “Chain rule (autoregressive)”p(x_1, ..., x_n) = p(x_1) · p(x_2 | x_1) · ... · p(x_n | x_1, ..., x_{n-1})Next-token prediction implements this term by term. Sampling: emit one token at a time; latency scales with output length.
Sample-and-decode (latent-variable)
Section titled “Sample-and-decode (latent-variable)”z ~ p(z) (e.g., standard Gaussian)x = decoder(z)Training is harder because p(x) = ∫ p(x|z) p(z) dz is intractable; the ELBO (lesson 5) is a lower bound that is.
Two-network game (GAN)
Section titled “Two-network game (GAN)”Generator tries to fool a Discriminator; Discriminator tries to tell fakes from reals. At equilibrium the generator’s samples are indistinguishable from real data. No likelihood is ever computed.
Denoise from noise (diffusion)
Section titled “Denoise from noise (diffusion)”Forward: data + noise + noise + noise + ... -> pure noise (cheap to simulate)Reverse: pure noise - predicted_noise - predicted_noise - ... -> data (learned)Network learns to predict the noise at each step. Approximates following the score = ∇ log p(x).
Place a modern system on the map
Section titled “Place a modern system on the map”| System | Training objective | Sampling procedure | Paradigm |
|---|---|---|---|
| Chat-style LLM | Next-token cross-entropy | One token at a time | Autoregressive |
| Stable Diffusion | Noise prediction at each step | Multi-step denoising | Diffusion |
| StyleGAN face gen | Adversarial loss | One generator pass | GAN |
Read the abstract for the training objective; read the architecture for the sampling procedure; the pair identifies the paradigm.
Where the trade-offs sit
Section titled “Where the trade-offs sit”- Sampling speed: Autoregressive scales with output length; Diffusion scales with step count; GAN/VAE are one pass.
- Likelihood: Exact for autoregressive + flows; lower bound for VAEs; undefined for GANs; via score for diffusion.
- Failure modes: Autoregressive drifts on long outputs; GANs collapse to few modes; Diffusion needs many steps for the last quality fraction.
- Hybrids: Latent diffusion = VAE encoder + diffusion in latent space. Decompose hybrids paradigm by paradigm; do not throw the map away.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- “Generative AI” as one thing. No, it is four paradigms with shared output behavior.
- Paradigm tied to modality. No, each modality has been generated by every paradigm.
- Sample quality as sole criterion. No, speed / controllability / latent structure / exact likelihood are real trade-offs.
- Hybrid as contradiction. No, decompose into its component paradigms.
Words to use precisely
Section titled “Words to use precisely”- Discriminative vs generative: labels-given-input vs learns-the-distribution.
- Autoregressive: factor by the chain rule; next-piece prediction.
- Latent variable (
z): a hidden code in a lower-dimensional space. - ELBO: evidence lower bound, the trainable surrogate for log-likelihood in latent-variable models.
- Discriminator vs generator (GAN roles).
- Score:
∇ log p(x), the gradient of log-density; the object diffusion implicitly learns. - Forward / reverse process (diffusion’s noising and denoising chains).
The one-line version
Section titled “The one-line version”A generative model learns a distribution well enough to sample from it, and there are four ways to do that (autoregressive, latent-variable, adversarial, score-based / diffusion); every modern system you have heard of is one of them.