Four-paradigm landscape: cheatsheet

The four paradigms in one table

Paradigm	Training objective	Sampling	Primary trade-off
Autoregressive	Next-token cross-entropy (chain-rule NLL)	Sequential: one forward pass per piece	Sequential sampling, exact likelihood
Latent-variable (VAE-style)	ELBO (reconstruction + KL)	Parallel: one prior draw + decoder forward pass	Bounded likelihood, parallel sampling, structured latent space
Adversarial (GAN / WGAN)	Minimax game (Jensen-Shannon or Wasserstein)	Parallel: one generator forward pass	No likelihood, sharp samples, training stability tricks
Score-based / diffusion	Noise-prediction MSE (denoising score matching)	Multi-step: tens of steps for production samplers	Multi-step sampling, indirect (but tractable via probability flow ODE) likelihood, broad mode coverage

Where widely-discussed modern systems sit

System	Paradigm	Notes
Modern autoregressive language model	Autoregressive	Causal-attention transformer + next-token cross-entropy; KV cache for inference
Stable Diffusion / latent diffusion	Latent-variable + Score-based hybrid	VAE encoder for compression to latent space + diffusion model in latent space + DDIM sampler + classifier-free guidance
GLIDE	Score-based	Pixel-space diffusion with classifier-free guidance, no VAE compression
StyleGAN-family face generators	Adversarial	Stable-training GAN variant (non-saturating logistic loss + R1 regularization, not WGAN-GP); sharp samples; latent-space controllability
Text-to-video diffusion	Score-based extended to video	Same noise-prediction MSE; spatio-temporal U-Net or transformer
Multimodal language + image	Autoregressive + Score-based hybrid	Language model conditions diffusion image generator
Normalizing-flow density estimator (scientific applications)	Flow (within likelihood-based family)	Exact likelihood via change-of-variables, one-pass sampling

The paradigm-fluency procedure

To read any new generative-AI system release:

Identify the training objective. Next-token cross-entropy → autoregressive. ELBO → VAE-family. Adversarial minimax → GAN. Noise-prediction MSE → diffusion. Combination → hybrid (name the components).
Identify the sampling procedure. One pass per output piece → autoregressive. One pass total → flow, VAE, or GAN. Multi-step with noise schedule → diffusion. Two-stage with inner sampling → latent-diffusion-style hybrid.
Predict the trade-offs. Sequential sampling means latency scales with output length. Adversarial means no likelihood. Diffusion means a multi-step cost. Hybrids inherit the trade-offs of their components.
Place on the map. Pick the paradigm or named combination. If the system claims an unusual property (a diffusion model that samples in one step; an autoregressive model with parallel sampling), the system is doing something non-standard; the paper will describe what.

What every paradigm shares

A trained network mapping the data domain to a function the paradigm needs approximated (next-piece distribution, base-distribution map, encoder + decoder pair, generator + discriminator, score function).
An information-theoretic objective tied to the data distribution (forward KL for likelihood-based, Jensen-Shannon or Wasserstein for GANs, score matching / Fisher divergence for score-based).
A sampling procedure with paradigm-specific cost (sequential, one-pass, multi-step).
Trade-offs that cannot be jointly optimized (exact likelihood vs parallel sampling; sharp samples vs likelihood evaluation; mode coverage vs sample sharpness).

Reading-rules summary

“Trained on next-token prediction” → autoregressive.
“Trained with the ELBO” or “variational autoencoder” → VAE-family.
“Generator and discriminator” or “Wasserstein-1 critic” → GAN-family.
“Noise predictor” or “denoising score matching” or “DDPM-style training” → diffusion.
“Combines a VAE and a diffusion model” → latent-diffusion hybrid.
“Language model conditioning a diffusion model” → multimodal autoregressive-plus-diffusion hybrid.
“Probability flow ODE for likelihood” → diffusion with the L14 tractable-likelihood machinery.
“Classifier-free guidance” → diffusion with the L13 conditioning trick.
“DDIM-style sampler” → diffusion with the L13 deterministic non-Markovian sampler.

What this track did NOT cover (carries to other tracks)

Systems engineering of training large models (distributed training, hardware, MLOps).
The policy, governance, and societal questions around generative AI (the §6 watch-territory framing on lessons 7, 12, 13, 14 named the categories; navigating them requires expertise in those forums).
Frontier research directions (new objectives, new architectures, new sampling procedures appear continuously).

The track gives you paradigm fluency for reading the math; the rest is what other tracks and other expertise cover.

Closing line

The map you opened the track with is the map you close the track with. The math underneath has been filled in. Generative models are not magic; they are math, and the math is the same math you already have.