Cheatsheet: The four-paradigm landscape and where modern systems sit
The four paradigms in one table
Section titled “The four paradigms in one table”| Paradigm | Training objective | Sampling | Primary trade-off |
|---|---|---|---|
| Autoregressive | Next-token cross-entropy (chain-rule NLL) | Sequential: one forward pass per piece | Sequential sampling, exact likelihood |
| Latent-variable (VAE-style) | ELBO (reconstruction + KL) | Parallel: one prior draw + decoder forward pass | Bounded likelihood, parallel sampling, structured latent space |
| Adversarial (GAN / WGAN) | Minimax game (Jensen-Shannon or Wasserstein) | Parallel: one generator forward pass | No likelihood, sharp samples, training stability tricks |
| Score-based / diffusion | Noise-prediction MSE (denoising score matching) | Multi-step: tens of steps for production samplers | Multi-step sampling, indirect (but tractable via probability flow ODE) likelihood, broad mode coverage |
Where widely-discussed modern systems sit
Section titled “Where widely-discussed modern systems sit”| System | Paradigm | Notes |
|---|---|---|
| Modern autoregressive language model | Autoregressive | Causal-attention transformer + next-token cross-entropy; KV cache for inference |
| Stable Diffusion / latent diffusion | Latent-variable + Score-based hybrid | VAE encoder for compression to latent space + diffusion model in latent space + DDIM sampler + classifier-free guidance |
| GLIDE | Score-based | Pixel-space diffusion with classifier-free guidance, no VAE compression |
| StyleGAN-family face generators | Adversarial | Stable-training GAN variant (non-saturating logistic loss + R1 regularization, not WGAN-GP); sharp samples; latent-space controllability |
| Text-to-video diffusion | Score-based extended to video | Same noise-prediction MSE; spatio-temporal U-Net or transformer |
| Multimodal language + image | Autoregressive + Score-based hybrid | Language model conditions diffusion image generator |
| Normalizing-flow density estimator (scientific applications) | Flow (within likelihood-based family) | Exact likelihood via change-of-variables, one-pass sampling |
The paradigm-fluency procedure
Section titled “The paradigm-fluency procedure”To read any new generative-AI system release:
- Identify the training objective. Next-token cross-entropy → autoregressive. ELBO → VAE-family. Adversarial minimax → GAN. Noise-prediction MSE → diffusion. Combination → hybrid (name the components).
- Identify the sampling procedure. One pass per output piece → autoregressive. One pass total → flow, VAE, or GAN. Multi-step with noise schedule → diffusion. Two-stage with inner sampling → latent-diffusion-style hybrid.
- Predict the trade-offs. Sequential sampling means latency scales with output length. Adversarial means no likelihood. Diffusion means a multi-step cost. Hybrids inherit the trade-offs of their components.
- Place on the map. Pick the paradigm or named combination. If the system claims an unusual property (a diffusion model that samples in one step; an autoregressive model with parallel sampling), the system is doing something non-standard; the paper will describe what.
What every paradigm shares
Section titled “What every paradigm shares”- A trained network mapping the data domain to a function the paradigm needs approximated (next-piece distribution, base-distribution map, encoder + decoder pair, generator + discriminator, score function).
- An information-theoretic objective tied to the data distribution (forward KL for likelihood-based, Jensen-Shannon or Wasserstein for GANs, score matching / Fisher divergence for score-based).
- A sampling procedure with paradigm-specific cost (sequential, one-pass, multi-step).
- Trade-offs that cannot be jointly optimized (exact likelihood vs parallel sampling; sharp samples vs likelihood evaluation; mode coverage vs sample sharpness).
Reading-rules summary
Section titled “Reading-rules summary”- “Trained on next-token prediction” → autoregressive.
- “Trained with the ELBO” or “variational autoencoder” → VAE-family.
- “Generator and discriminator” or “Wasserstein-1 critic” → GAN-family.
- “Noise predictor” or “denoising score matching” or “DDPM-style training” → diffusion.
- “Combines a VAE and a diffusion model” → latent-diffusion hybrid.
- “Language model conditioning a diffusion model” → multimodal autoregressive-plus-diffusion hybrid.
- “Probability flow ODE for likelihood” → diffusion with the L14 tractable-likelihood machinery.
- “Classifier-free guidance” → diffusion with the L13 conditioning trick.
- “DDIM-style sampler” → diffusion with the L13 deterministic non-Markovian sampler.
What this track did NOT cover (carries to other tracks)
Section titled “What this track did NOT cover (carries to other tracks)”- Systems engineering of training large models (distributed training, hardware, MLOps).
- The policy, governance, and societal questions around generative AI (the §6 watch-territory framing on lessons 7, 12, 13, 14 named the categories; navigating them requires expertise in those forums).
- Frontier research directions (new objectives, new architectures, new sampling procedures appear continuously).
The track gives you paradigm fluency for reading the math; the rest is what other tracks and other expertise cover.
Closing line
Section titled “Closing line”The map you opened the track with is the map you close the track with. The math underneath has been filled in. Generative models are not magic; they are math, and the math is the same math you already have.