Skip to content

References: Autoregressive models, factoring by the chain rule

Source curricula (multi-source structural mirror; cited as further study):
PRIMARY (this lesson follows its framing most directly)
• Stanford CS236, "Deep Generative Models", Lecture 3: Autoregressive Models
Instructor: Stefano Ermon
Course URL: https://deepgenerativemodels.github.io/
Syllabus: https://deepgenerativemodels.github.io/syllabus.html
License: standard course-page link-out; cited as further study
SECONDARY (also contributed to this lesson's framing)
• Berkeley CS294-158, "Deep Unsupervised Learning" (Spring 2024), Lecture 2: Autoregressive Models
Instructors: Pieter Abbeel, Wilson Yan, Kevin Frans, Philipp Wu
Course URL: https://sites.google.com/view/berkeley-cs294-158-sp24/
License: standard course-page link-out; cited as further study
Clawdemy's lessons are original prose that follows the pedagogical arc of these
two courses, anchored on CS236's lecture order with CS294-158 framing pulled in
where its slide deck and recording are stronger. We do not reproduce or
transcribe the lectures; we cite them as the recommended companions. All rights
to the original course materials remain with the respective instructors and
institutions.

A short, durable list. Each link is a specific next step, not a generic pile.

  • “Pixel Recurrent Neural Networks” (van den Oord, Kalchbrenner, Kavukcuoglu, 2016). The PixelRNN paper, the canonical foundational paper for autoregressive image modeling. It introduces both PixelRNN and the masked-convolution PixelCNN that the lesson mentions. The paper is famously crisp; the introduction alone is a good complement to this lesson.

  • “Attention Is All You Need” (Vaswani et al., 2017). The transformer paper. Section 3 introduces the masked self-attention that makes a transformer causal, and the architecture is the engine under every modern autoregressive language model. Read with the chain rule in mind: every masked-attention diagram in the paper is a picture of “predict x_i from x_1, ..., x_{i-1}.”

Where this sits in the track.

  • What a generative model is, and the four-paradigm map (previous lesson). Autoregressive was paradigm 1; this lesson opened it up. The chain-rule recipe (factor + model each conditional + reassemble) is unique to autoregressive; the other three paradigms use other tricks.

  • Maximum likelihood and the KL view (next lesson). This lesson stated “minimize the negative log-likelihood” as the training objective without justifying why it is the natural choice. The next lesson derives it from first principles, showing that NLL is the empirical estimate of minimizing the forward KL divergence from the data distribution to the model. The same one-line identity unifies the training objective for autoregressive models, flows, and (via the ELBO) latent-variable models.

  • Normalizing flows (lesson 4). Flows are autoregressive in spirit, with a different parameterization that lets you sample in parallel instead of sequentially, at the cost of architectural constraints (the change-of-variables formula and the Jacobian). The chain-rule scaffolding here will help when flows generalize it.