References: Autoregressive models, factoring by the chain rule

Source material

Source curricula (multi-source structural mirror; cited as further study):

PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 3: Autoregressive Models
  Instructor: Stefano Ermon
  Course URL: https://deepgenerativemodels.github.io/
  Syllabus: https://deepgenerativemodels.github.io/syllabus.html
  License: standard course-page link-out; cited as further study

SECONDARY (also contributed to this lesson's framing)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024), Lecture 2: Autoregressive Models
  Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
  Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
  License: standard course-page link-out; cited as further study

Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.

Watch this next

Stanford CS236 (Stefano Ermon), course homepage. The primary anchor for this lesson; Lecture 3 (Autoregressive Models) walks the chain rule, the NLL objective, and the architectural moves (masked attention, causal convolutions) this lesson mirrors. The course notes at deepgenerativemodels.github.io/notes include a written treatment of autoregressive models that is more careful than the lecture slides on the conditional-parameterization details.
Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. Lecture 2 (Autoregressive Models) is the secondary source. The slide deck includes worked examples on text and audio (WaveNet) that complement CS236’s image-heavy emphasis.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“Pixel Recurrent Neural Networks” (van den Oord, Kalchbrenner, Kavukcuoglu, 2016). The PixelRNN paper, the canonical foundational paper for autoregressive image modeling. It introduces both PixelRNN and the masked-convolution PixelCNN that the lesson mentions. The paper is famously crisp; the introduction alone is a good complement to this lesson.
“Attention Is All You Need” (Vaswani et al., 2017). The transformer paper. Section 3 introduces the masked self-attention that makes a transformer causal, and the architecture is the engine under every modern autoregressive language model. Read with the chain rule in mind: every masked-attention diagram in the paper is a picture of “predict x_i from x_1, ..., x_{i-1}.”

Adjacent topics

Where this sits in the track.

What a generative model is, and the four-paradigm map (previous lesson). Autoregressive was paradigm 1; this lesson opened it up. The chain-rule recipe (factor + model each conditional + reassemble) is unique to autoregressive; the other three paradigms use other tricks.
Maximum likelihood and the KL view (next lesson). This lesson stated “minimize the negative log-likelihood” as the training objective without justifying why it is the natural choice. The next lesson derives it from first principles, showing that NLL is the empirical estimate of minimizing the forward KL divergence from the data distribution to the model. The same one-line identity unifies the training objective for autoregressive models, flows, and (via the ELBO) latent-variable models.
Normalizing flows (lesson 4). Flows are autoregressive in spirit, with a different parameterization that lets you sample in parallel instead of sequentially, at the cost of architectural constraints (the change-of-variables formula and the Jacobian). The chain-rule scaffolding here will help when flows generalize it.