Generative model paradigms: brief

What you’ll learn

This is the opening lesson of Track 19 (Generative Models and Diffusion), and the whole track is organized around the map this lesson builds. By the end you will be able to look at any modern AI system that generates something (text, images, audio, video) and place it in one of four buckets at a glance: autoregressive (predict the next piece, one at a time), latent-variable (sample a code, run a decoder), adversarial (two networks compete), or score-based / diffusion (denoise step by step from pure noise). You will see why “generative” is a precise technical category (learning p(x) well enough to sample from it), how each paradigm couples a specific training objective to a specific sampling procedure, and why reading the paradigm tells you the sampling speed, the kinds of likelihood you get, and the failure modes you inherit. The source curricula are Stanford’s CS236 (Stefano Ermon), the primary anchor, and Berkeley’s CS294-158 (Pieter Abbeel et al.), both freely available at the course pages linked in References.

Where this fits

This is lesson 1 of 15 and the entry point of the track. There is no previous lesson here; the technical prerequisites are math (see “Before you start”), not Clawdemy content. The next lesson opens up the autoregressive paradigm (the one every modern large language model lives in), the rest of Phase 1 covers maximum likelihood and normalizing flows, Phase 2 builds VAEs and GANs and how generative models get evaluated, and Phase 3 builds energy-based, score-based, and diffusion models in full, closing at lesson 15 with a synthesis that returns to this map and places today’s most-used systems on it explicitly.

Before you start

Prerequisites: the math gate is real. You should be comfortable with what Track 4 (Visual Math: Linear Algebra) builds (vectors, matrices, eigenvectors, change of basis) and what Track 8 (Visual Math: Calculus) builds (derivatives, gradients, the basic chain rule), and willing to keep up with basic probability (distributions, expectation, conditioning, KL divergence). The track will fill in any specific probability concept you have not met as it appears, but the underlying linear algebra and calculus will be assumed. This opening lesson is the gentlest in the track: it is mostly conceptual placement, with no derivations. Later lessons unfold the math behind each paradigm.

About the math

This lesson is the conceptual opener and carries almost no math: a one-line statement of the chain rule, a sample-and-decode pseudocode block, and a sentence describing the noising / denoising chain. There are no derivations, no formulas to manipulate, and no exercises that require pen and paper for arithmetic. The track’s math density steps up immediately in lesson 2 (chain-rule factorization and next-token log-likelihood), then more in lessons 3-4 (maximum likelihood / KL and change of variables for normalizing flows). This lesson exists to give the map; the math hangs on the map in the lessons that follow.

By the end, you’ll be able to

State the technical distinction between discriminative and generative models, and define generative as learning p(x) well enough to sample from it
Name the four paradigms of generative modeling (autoregressive, latent-variable, adversarial, score-based / diffusion) and give the one-line description of each
State the training objective and sampling procedure for each paradigm
Place a modern system (chat-style LLM, Stable Diffusion, GAN-based face generator, latent-diffusion hybrid) on the four-paradigm map by reading its training objective and sampling procedure
Predict a system’s sampling speed, the kinds of likelihood it provides, and its primary failure modes from its paradigm

Time and difficulty

Read time: about 13 minutes
Practice time: about 15 minutes (a six-question self-check, a place-the-system exercise on five system descriptions, a paradigm-trade-off drill on three use cases, and flashcards)
Difficulty: standard (a conceptual orientation lesson in a Stage D math-heavy track; this opener is the gentlest in the track, with no derivations, but the audience is assumed technical)