References: Autoregressive models, factoring by the chain rule
Source material
Section titled “Source material”Source curricula (multi-source structural mirror; cited as further study):
PRIMARY (this lesson follows its framing most directly)• Stanford CS236, "Deep Generative Models", Lecture 3: Autoregressive Models Instructor: Stefano Ermon Course URL: https://deepgenerativemodels.github.io/ Syllabus: https://deepgenerativemodels.github.io/syllabus.html License: standard course-page link-out; cited as further study
SECONDARY (also contributed to this lesson's framing)• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024), Lecture 2: Autoregressive Models Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/ License: standard course-page link-out; cited as further study
Clawdemy's lessons are original prose that follows the pedagogical arc of thesetwo courses, anchored on CS236's lecture order with CS294-158 framing pulled inwhere its slide deck and recording are stronger. We do not reproduce ortranscribe the lectures; we cite them as the recommended companions. All rightsto the original course materials remain with the respective instructors andinstitutions.Watch this next
Section titled “Watch this next”-
Stanford CS236 (Stefano Ermon), course homepage. The primary anchor for this lesson; Lecture 3 (Autoregressive Models) walks the chain rule, the NLL objective, and the architectural moves (masked attention, causal convolutions) this lesson mirrors. The course notes at deepgenerativemodels.github.io/notes include a written treatment of autoregressive models that is more careful than the lecture slides on the conditional-parameterization details.
-
Berkeley CS294-158 Sp24 (Pieter Abbeel et al.), course homepage. Lecture 2 (Autoregressive Models) is the secondary source. The slide deck includes worked examples on text and audio (WaveNet) that complement CS236’s image-heavy emphasis.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“Pixel Recurrent Neural Networks” (van den Oord, Kalchbrenner, Kavukcuoglu, 2016). The PixelRNN paper, the canonical foundational paper for autoregressive image modeling. It introduces both PixelRNN and the masked-convolution PixelCNN that the lesson mentions. The paper is famously crisp; the introduction alone is a good complement to this lesson.
-
“Attention Is All You Need” (Vaswani et al., 2017). The transformer paper. Section 3 introduces the masked self-attention that makes a transformer causal, and the architecture is the engine under every modern autoregressive language model. Read with the chain rule in mind: every masked-attention diagram in the paper is a picture of “predict
x_ifromx_1, ..., x_{i-1}.”
Adjacent topics
Section titled “Adjacent topics”Where this sits in the track.
-
What a generative model is, and the four-paradigm map (previous lesson). Autoregressive was paradigm 1; this lesson opened it up. The chain-rule recipe (factor + model each conditional + reassemble) is unique to autoregressive; the other three paradigms use other tricks.
-
Maximum likelihood and the KL view (next lesson). This lesson stated “minimize the negative log-likelihood” as the training objective without justifying why it is the natural choice. The next lesson derives it from first principles, showing that NLL is the empirical estimate of minimizing the forward KL divergence from the data distribution to the model. The same one-line identity unifies the training objective for autoregressive models, flows, and (via the ELBO) latent-variable models.
-
Normalizing flows (lesson 4). Flows are autoregressive in spirit, with a different parameterization that lets you sample in parallel instead of sequentially, at the cost of architectural constraints (the change-of-variables formula and the Jacobian). The chain-rule scaffolding here will help when flows generalize it.