Skip to content

Cheatsheet: The four-paradigm landscape and where modern systems sit

ParadigmTraining objectiveSamplingPrimary trade-off
AutoregressiveNext-token cross-entropy (chain-rule NLL)Sequential: one forward pass per pieceSequential sampling, exact likelihood
Latent-variable (VAE-style)ELBO (reconstruction + KL)Parallel: one prior draw + decoder forward passBounded likelihood, parallel sampling, structured latent space
Adversarial (GAN / WGAN)Minimax game (Jensen-Shannon or Wasserstein)Parallel: one generator forward passNo likelihood, sharp samples, training stability tricks
Score-based / diffusionNoise-prediction MSE (denoising score matching)Multi-step: tens of steps for production samplersMulti-step sampling, indirect (but tractable via probability flow ODE) likelihood, broad mode coverage
SystemParadigmNotes
Modern autoregressive language modelAutoregressiveCausal-attention transformer + next-token cross-entropy; KV cache for inference
Stable Diffusion / latent diffusionLatent-variable + Score-based hybridVAE encoder for compression to latent space + diffusion model in latent space + DDIM sampler + classifier-free guidance
GLIDEScore-basedPixel-space diffusion with classifier-free guidance, no VAE compression
StyleGAN-family face generatorsAdversarialStable-training GAN variant (non-saturating logistic loss + R1 regularization, not WGAN-GP); sharp samples; latent-space controllability
Text-to-video diffusionScore-based extended to videoSame noise-prediction MSE; spatio-temporal U-Net or transformer
Multimodal language + imageAutoregressive + Score-based hybridLanguage model conditions diffusion image generator
Normalizing-flow density estimator (scientific applications)Flow (within likelihood-based family)Exact likelihood via change-of-variables, one-pass sampling

To read any new generative-AI system release:

  1. Identify the training objective. Next-token cross-entropy → autoregressive. ELBO → VAE-family. Adversarial minimax → GAN. Noise-prediction MSE → diffusion. Combination → hybrid (name the components).
  2. Identify the sampling procedure. One pass per output piece → autoregressive. One pass total → flow, VAE, or GAN. Multi-step with noise schedule → diffusion. Two-stage with inner sampling → latent-diffusion-style hybrid.
  3. Predict the trade-offs. Sequential sampling means latency scales with output length. Adversarial means no likelihood. Diffusion means a multi-step cost. Hybrids inherit the trade-offs of their components.
  4. Place on the map. Pick the paradigm or named combination. If the system claims an unusual property (a diffusion model that samples in one step; an autoregressive model with parallel sampling), the system is doing something non-standard; the paper will describe what.
  • A trained network mapping the data domain to a function the paradigm needs approximated (next-piece distribution, base-distribution map, encoder + decoder pair, generator + discriminator, score function).
  • An information-theoretic objective tied to the data distribution (forward KL for likelihood-based, Jensen-Shannon or Wasserstein for GANs, score matching / Fisher divergence for score-based).
  • A sampling procedure with paradigm-specific cost (sequential, one-pass, multi-step).
  • Trade-offs that cannot be jointly optimized (exact likelihood vs parallel sampling; sharp samples vs likelihood evaluation; mode coverage vs sample sharpness).
  • “Trained on next-token prediction” → autoregressive.
  • “Trained with the ELBO” or “variational autoencoder” → VAE-family.
  • “Generator and discriminator” or “Wasserstein-1 critic” → GAN-family.
  • “Noise predictor” or “denoising score matching” or “DDPM-style training” → diffusion.
  • “Combines a VAE and a diffusion model” → latent-diffusion hybrid.
  • “Language model conditioning a diffusion model” → multimodal autoregressive-plus-diffusion hybrid.
  • “Probability flow ODE for likelihood” → diffusion with the L14 tractable-likelihood machinery.
  • “Classifier-free guidance” → diffusion with the L13 conditioning trick.
  • “DDIM-style sampler” → diffusion with the L13 deterministic non-Markovian sampler.

What this track did NOT cover (carries to other tracks)

Section titled “What this track did NOT cover (carries to other tracks)”
  • Systems engineering of training large models (distributed training, hardware, MLOps).
  • The policy, governance, and societal questions around generative AI (the §6 watch-territory framing on lessons 7, 12, 13, 14 named the categories; navigating them requires expertise in those forums).
  • Frontier research directions (new objectives, new architectures, new sampling procedures appear continuously).

The track gives you paradigm fluency for reading the math; the rest is what other tracks and other expertise cover.

The map you opened the track with is the map you close the track with. The math underneath has been filled in. Generative models are not magic; they are math, and the math is the same math you already have.