Summary: What a generative model is, and the four-paradigm map

You have used a generative model this week. Chat assistants, image generators, voice transcription, even phone-keyboard autocomplete are all generative models in the technical sense, and they all sit in one of four paradigms. The whole lesson reduces to this: a generative model learns a distribution well enough to sample new data from it, and there are four ways to do that, each with its own training objective and sampling procedure. Place a system on the map and you have read the trade-offs it inherits. This is the scan-it-in-five-minutes version.

Core ideas

A discriminative model learns p(y | x) (label given input) and draws a class boundary. A generative model learns p(x) (or p(x | c)), the distribution over the data itself, which lets you both sample new examples and compute their likelihood. “Generate an image of X” = “draw a sample from p(image | text = X).”
The four paradigms: each is a different way to represent the distribution, with a tightly coupled training objective and sampling procedure.
Autoregressive: factor the joint by the chain rule of probability and predict one piece at a time. Train on next-piece log-likelihood (next-token cross-entropy). Sample sequentially. Exact likelihood. Every modern chat-style language model is here.
Latent-variable: introduce a hidden code z sampled from a simple distribution, and run a learned decoder to produce data. Train with the ELBO (a lower bound on log-likelihood). Sample with one decoder pass. VAEs and many encoder-decoder systems are here.
Adversarial (GAN): no likelihood at all. A generator and a discriminator play a minimax game until the generator’s samples become indistinguishable from real data. Sample with one generator pass. Famous for sharp images and notorious for training instability and mode collapse.
Score-based / diffusion: generate by reversing a noising process. The forward process turns data into Gaussian noise step by step; the reverse process uses a learned network to denoise step by step back to data. Mathematically tracks the score ∇ log p(x). Sample with many denoising steps. Stable Diffusion and modern image / video / audio generators live here.
Read a system in seconds: training objective + sampling procedure identify the paradigm. Next-token cross-entropy + sequential sampling = autoregressive. ELBO + one decoder pass = latent-variable. Adversarial loss + one generator pass = GAN. Noise-prediction MSE + multi-step denoising = diffusion. Worked anchors: GPT-style LLM = autoregressive; Stable Diffusion = diffusion; StyleGAN = GAN.
Trade-offs follow the paradigm, not the architecture. Sampling speed: GANs/VAEs are one pass (fastest), autoregressive is sequential (scales with length), diffusion is multi-step (scales with step count). Likelihood: exact for autoregressive + flows, lower bound for VAEs, none for GANs, indirect (via score) for diffusion. Failure modes: drift on long outputs (autoregressive), mode collapse (GAN), high step counts for the last quality (diffusion).
Hybrids are decomposable. Latent diffusion is a VAE-style encoder plus a diffusion process in the latent space; place by the generative step (diffusion) and note the inherited latent-variable component.

What changes for you

Before this lesson, “generative AI” was probably one buzzword that flattened a sprawling field. Now it is four distinct mathematical paradigms, each with a specific training objective and sampling procedure, and you can place any modern system (chat-style LLM, Stable Diffusion, StyleGAN, latent diffusion hybrid) on the map at a glance. When you next read a paper title or a model card, you have two precise questions you can ask, “what training objective?” and “what sampling procedure?”, and the pair points at exactly one paradigm. The rest of the track is the math behind each paradigm, in order: autoregressive next (lesson 2), then maximum likelihood, normalizing flows, VAEs, GANs, evaluation, energy-based models, score matching, diffusion in full, the unifying SDE view, and a closing synthesis at lesson 15 that places today’s most-used systems on this same map with all the math filled in.