Skip to content

Autoregressive models, factoring by the chain rule

This is lesson 2 of Track 19 (Generative Models and Diffusion), and the first lesson where the math density steps up. The opener placed autoregressive models as paradigm 1 on the four-paradigm map; this lesson opens up the math. By the end you will be able to write any joint distribution as a product of conditionals (using the chain rule of probability, an exact identity), state the training objective every modern large language model is trained on (negative log-likelihood, equivalent to next-token cross-entropy), compute the joint probability and NLL of a short sequence by hand from given conditionals, and recognize the chain-rule factorization in the architecture of any autoregressive system (text, images, or audio). The source curricula are Stanford CS236 (primary anchor) and Berkeley CS294-158 (secondary framing).

This is lesson 2 of 15, the second step of Phase 1 (generative foundations). It opens up paradigm 1 from the lesson-1 map. The next lesson, Maximum likelihood and the KL view, lifts to the formal framework that makes NLL the natural objective (the same NLL you compute here, derived from minimizing the forward KL divergence). Lesson 4 covers normalizing flows, which are autoregressive in spirit with a parameterization that lets you sample in parallel at the cost of architectural constraints (change-of-variables and the Jacobian).

Prerequisites: the previous lesson, What a generative model is, and the four-paradigm map, for the four-paradigm placement context. The math background is the same as the rest of T19: comfort with the basic probability ideas (conditional probability, joint distribution, the rule p(A, B) = p(A) · p(B | A)), and willingness to take a log and a product. No new mathematical machinery is introduced; the chain rule is derived in one line from the definition of conditional probability.

The lesson uses three pieces of notation: the chain rule (a product of conditionals), the negative log-likelihood (a sum of negative log-probabilities), and softmax conditionals (probabilities normalized over a vocabulary). The arithmetic is genuinely small; a worked example multiplies three numbers and takes a log. The architectural moves (masked attention, masked convolution) are described in prose with reference to the canonical papers in References. The reading is harder than lesson 1 not because the math is harder but because there is more of it.

  • State the chain rule of probability and apply it to factor a joint distribution into a product of per-piece conditionals
  • Write the negative log-likelihood as a sum of per-piece log-probabilities and recognize it as next-token cross-entropy for text
  • Compute the joint probability and NLL of a short sequence by hand from given conditionals
  • Explain why causality must be enforced architecturally (masked attention, masked convolution) rather than in the loss
  • Describe the sampling cost of an autoregressive model (one forward pass per piece) and the KV-caching optimization
  • Read time: about 13 minutes
  • Practice time: about 15 minutes (a six-question self-check, a compute-the-joint-and-NLL exercise on a fresh sequence, a spot-the-causality-bug drill on four architecture descriptions, and flashcards)
  • Difficulty: standard (the first math-density lesson of the track; the arithmetic is small, but the architectural reasoning about causality requires care)