Summary: Autoregressive models, factoring by the chain rule

Last lesson called autoregressive models paradigm 1 and stopped at the one-line description. This lesson opened the math. The whole thing reduces to one line: an autoregressive model factors any joint distribution by the chain rule of probability, learns each conditional with a neural network that respects causality, and trains by minimizing the negative log-likelihood, which on text is next-token cross-entropy. Every modern LLM is exactly this. This is the scan-it-in-five-minutes version.

Core ideas

The chain rule of probability is an exact identity, not an approximation: p(x_1, ..., x_n) = p(x_1) · p(x_2 | x_1) · ... · p(x_n | x_1, ..., x_{n-1}). It holds for every joint distribution. An autoregressive model models each per-piece conditional with a neural network, and reassembles the joint by this product.
The model is generative in the lesson-1 sense for free: sampling runs the factors in order (draw x_1, then x_2 | x_1, then x_3 | x_1, x_2, …); likelihood evaluation plugs an example into each factor and multiplies. No new machinery is needed beyond the per-piece conditionals.
The training objective is negative log-likelihood (NLL). By the chain rule, -log p(x_1, ..., x_n) = sum_i -log p(x_i | x_1, ..., x_{i-1}), a sum of per-piece terms. For text this is next-token cross-entropy, the loss every modern large language model is trained on.
Worked anchor: vocabulary {A, B, C}, sequence BAC, model conditionals p(B)=0.4, p(A|B)=0.5, p(C|B,A)=0.6. Joint p(BAC) = 0.4·0.5·0.6 = 0.12. NLL -log(0.12) ≈ 2.12, which equals the per-piece sum -log(0.4) - log(0.5) - log(0.6) ≈ 0.916 + 0.693 + 0.511 ≈ 2.12. Same answer both ways because log turns a product into a sum.
Causality must be architectural, not enforced in the loss. If the network can read positions > i while predicting x_i, the optimizer will exploit it; training looks fine but sampling breaks. The architectural moves are masked self-attention (causal transformer, the engine under every LLM) and masked convolution (PixelCNN for images, WaveNet for audio): each removes the connectivity that would let the network peek at the future.
Sampling cost is one forward pass per piece. Naively quadratic in prefix length (each pass attends to the full prefix); with KV caching roughly linear in output length. Total generation cost still scales with output length, the autoregressive paradigm’s signature trade-off in exchange for exact likelihood.
Paradigm properties that show up downstream when you use an LLM: streaming output falls out of one-piece-at-a-time sampling; likelihood scores are usable (unique to autoregressive and flows among the four paradigms); long-context drift is paradigm-level (early errors cascade through the chain).
The same paradigm runs on images (PixelCNN), audio (WaveNet), and text (the transformer-based LLMs). The architecture is one of many; the math (chain rule + per-conditional + masked connectivity) is the paradigm.

What changes for you

Before this lesson, “an LLM does next-token prediction” was likely a phrase that named the right thing without explaining it. Now it is a specific math identity (the chain rule), a specific training objective (per-piece NLL, equivalent to next-token cross-entropy), and a specific architectural constraint (causal masking). When you next read about a new language-model architecture, you can ignore the flashy headline and ask three concrete questions: how is each conditional parameterized, how is causality enforced, and is the loss next-piece NLL? The answer to those three places any autoregressive system unambiguously on the map from lesson 1. The next lesson lifts up one level to the formal framework that makes NLL the natural objective: maximum likelihood and the KL divergence.