Skip to content

Cheatsheet: What a generative model is, and the four-paradigm map

Model typeLearnsLets you
Discriminativep(y | x) (label given input)Classify / predict
Generativep(x) (or p(x | c))Compute likelihood AND sample new data

A model is generative whenever it can sample new examples from a learned distribution. “Generate an image of X” = “sample from p(image \| text = X).”

ParadigmTraining objectiveSampling procedureLikelihood?Famous in
AutoregressiveMaximize next-piece log-likelihood (chain rule)Sequential, one piece at a timeYes, exactModern LLMs; PixelRNN/WaveNet
Latent-variableMaximize the ELBO (lower bound on likelihood)Sample latent z, run decoder onceYes, boundedVAEs; pre-diffusion image gens
Adversarial (GAN)Minimax game (G vs D)Sample latent, run generator onceNoStyleGAN; older image / face gens
Score-based / diffusionNoise-prediction MSE at every stepMulti-step denoising from pure noiseIndirectly (via score)Stable Diffusion; modern image / video / audio

Each paradigm’s training objective and sampling procedure are tightly coupled to how it represents the underlying distribution.

p(x_1, ..., x_n) = p(x_1) · p(x_2 | x_1) · ... · p(x_n | x_1, ..., x_{n-1})

Next-token prediction implements this term by term. Sampling: emit one token at a time; latency scales with output length.

z ~ p(z) (e.g., standard Gaussian)
x = decoder(z)

Training is harder because p(x) = ∫ p(x|z) p(z) dz is intractable; the ELBO (lesson 5) is a lower bound that is.

Generator tries to fool a Discriminator; Discriminator tries to tell fakes from reals. At equilibrium the generator’s samples are indistinguishable from real data. No likelihood is ever computed.

Forward: data + noise + noise + noise + ... -> pure noise (cheap to simulate)
Reverse: pure noise - predicted_noise - predicted_noise - ... -> data (learned)

Network learns to predict the noise at each step. Approximates following the score = ∇ log p(x).

SystemTraining objectiveSampling procedureParadigm
Chat-style LLMNext-token cross-entropyOne token at a timeAutoregressive
Stable DiffusionNoise prediction at each stepMulti-step denoisingDiffusion
StyleGAN face genAdversarial lossOne generator passGAN

Read the abstract for the training objective; read the architecture for the sampling procedure; the pair identifies the paradigm.

  • Sampling speed: Autoregressive scales with output length; Diffusion scales with step count; GAN/VAE are one pass.
  • Likelihood: Exact for autoregressive + flows; lower bound for VAEs; undefined for GANs; via score for diffusion.
  • Failure modes: Autoregressive drifts on long outputs; GANs collapse to few modes; Diffusion needs many steps for the last quality fraction.
  • Hybrids: Latent diffusion = VAE encoder + diffusion in latent space. Decompose hybrids paradigm by paradigm; do not throw the map away.
  • “Generative AI” as one thing. No, it is four paradigms with shared output behavior.
  • Paradigm tied to modality. No, each modality has been generated by every paradigm.
  • Sample quality as sole criterion. No, speed / controllability / latent structure / exact likelihood are real trade-offs.
  • Hybrid as contradiction. No, decompose into its component paradigms.
  • Discriminative vs generative: labels-given-input vs learns-the-distribution.
  • Autoregressive: factor by the chain rule; next-piece prediction.
  • Latent variable (z): a hidden code in a lower-dimensional space.
  • ELBO: evidence lower bound, the trainable surrogate for log-likelihood in latent-variable models.
  • Discriminator vs generator (GAN roles).
  • Score: ∇ log p(x), the gradient of log-density; the object diffusion implicitly learns.
  • Forward / reverse process (diffusion’s noising and denoising chains).

A generative model learns a distribution well enough to sample from it, and there are four ways to do that (autoregressive, latent-variable, adversarial, score-based / diffusion); every modern system you have heard of is one of them.