Autoregressive models: brief

What you’ll learn

This is lesson 2 of Track 19 (Generative Models and Diffusion), and the first lesson where the math density steps up. The opener placed autoregressive models as paradigm 1 on the four-paradigm map; this lesson opens up the math. By the end you will be able to write any joint distribution as a product of conditionals (using the chain rule of probability, an exact identity), state the training objective every modern large language model is trained on (negative log-likelihood, equivalent to next-token cross-entropy), compute the joint probability and NLL of a short sequence by hand from given conditionals, and recognize the chain-rule factorization in the architecture of any autoregressive system (text, images, or audio). The source curricula are Stanford CS236 (primary anchor) and Berkeley CS294-158 (secondary framing).

Where this fits

This is lesson 2 of 15, the second step of Phase 1 (generative foundations). It opens up paradigm 1 from the lesson-1 map. The next lesson, Maximum likelihood and the KL view, lifts to the formal framework that makes NLL the natural objective (the same NLL you compute here, derived from minimizing the forward KL divergence). Lesson 4 covers normalizing flows, which are autoregressive in spirit with a parameterization that lets you sample in parallel at the cost of architectural constraints (change-of-variables and the Jacobian).

Before you start

Prerequisites: the previous lesson, What a generative model is, and the four-paradigm map, for the four-paradigm placement context. The math background is the same as the rest of T19: comfort with the basic probability ideas (conditional probability, joint distribution, the rule p(A, B) = p(A) · p(B | A)), and willingness to take a log and a product. No new mathematical machinery is introduced; the chain rule is derived in one line from the definition of conditional probability.

About the math

The lesson uses three pieces of notation: the chain rule (a product of conditionals), the negative log-likelihood (a sum of negative log-probabilities), and softmax conditionals (probabilities normalized over a vocabulary). The arithmetic is genuinely small; a worked example multiplies three numbers and takes a log. The architectural moves (masked attention, masked convolution) are described in prose with reference to the canonical papers in References. The reading is harder than lesson 1 not because the math is harder but because there is more of it.

By the end, you’ll be able to

State the chain rule of probability and apply it to factor a joint distribution into a product of per-piece conditionals
Write the negative log-likelihood as a sum of per-piece log-probabilities and recognize it as next-token cross-entropy for text
Compute the joint probability and NLL of a short sequence by hand from given conditionals
Explain why causality must be enforced architecturally (masked attention, masked convolution) rather than in the loss
Describe the sampling cost of an autoregressive model (one forward pass per piece) and the KV-caching optimization

Time and difficulty

Read time: about 13 minutes
Practice time: about 15 minutes (a six-question self-check, a compute-the-joint-and-NLL exercise on a fresh sequence, a spot-the-causality-bug drill on four architecture descriptions, and flashcards)
Difficulty: standard (the first math-density lesson of the track; the arithmetic is small, but the architectural reasoning about causality requires care)