Skip to content

References: The Transformer architecture

Source curriculum (structural mirror, cited as further study):
• Stanford CS336, "Language Modeling from Scratch", Lecture 3:
Architectures, hyperparameters
Instructors: Tatsunori Hashimoto and Percy Liang (Stanford)
Course page: https://cs336.stanford.edu/
Lecture videos: YouTube playlist
https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV
License: no explicit license is published on the course site; lecture
videos are on YouTube under standard terms; slides are public on GitHub
without a stated license.
Required attribution: "Based on the structure of Stanford CS336,
'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang
(cs336.stanford.edu). This is an independent structural mirror in
original prose; it reproduces no course materials, and Stanford does
not endorse it."
This lesson mirrors the structure of Lecture 3 (architecture and
hyperparameters). Clawdemy's lessons are original prose that follows the
pedagogical arc of the course. Because the source publishes no explicit
license, we cite it as a recommended companion and reproduce none of its
materials. All rights to the original course materials remain with their
creators.

A short, durable list. Each link is a specific next step, not a generic pile.

Where this connects inside the track.

  • Counting the cost (lesson 2). The 12 * n_layers * d_model^2 parameter formula here feeds straight into lesson 2’s 6ND (compute) and 16N (memory) accounting.

  • Attention alternatives and mixture of experts (lesson 4). The next lesson varies the attention sublayer of this skeleton and adds the mixture-of-experts FFN variant, closing Phase 1.

  • Scaling laws (lesson 9). Choosing the hyperparameters here (depth vs width vs data for a fixed budget) is exactly what scaling laws decide with evidence.