References: The Transformer architecture
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Stanford CS336, "Language Modeling from Scratch", Lecture 3: Architectures, hyperparameters Instructors: Tatsunori Hashimoto and Percy Liang (Stanford) Course page: https://cs336.stanford.edu/ Lecture videos: YouTube playlist https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV License: no explicit license is published on the course site; lecture videos are on YouTube under standard terms; slides are public on GitHub without a stated license. Required attribution: "Based on the structure of Stanford CS336, 'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang (cs336.stanford.edu). This is an independent structural mirror in original prose; it reproduces no course materials, and Stanford does not endorse it."This lesson mirrors the structure of Lecture 3 (architecture andhyperparameters). Clawdemy's lessons are original prose that follows thepedagogical arc of the course. Because the source publishes no explicitlicense, we cite it as a recommended companion and reproduce none of itsmaterials. All rights to the original course materials remain with theircreators.Watch this next
Section titled “Watch this next”- Stanford CS336, Lecture 3: Architectures and hyperparameters by Hashimoto and Liang. The lecture this lesson mirrors. It surveys the modern architecture choices and their justifications in more depth, with the empirical evidence behind each.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“RoFormer: Enhanced Transformer with Rotary Position Embedding” by Su et al. (2021). The paper that introduced RoPE, now the standard positional scheme. Read it for why rotating queries and keys encodes relative position.
-
“GLU Variants Improve Transformer” by Noam Shazeer (2020). The short paper behind gated FFN activations like SwiGLU, including why the hidden dimension is reduced to keep parameters matched.
-
The Llama model card and config. A concrete modern open model whose config shows exactly these fields (RMSNorm, RoPE, SwiGLU, no biases). The best way to see the converged choices in a real released model.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
Counting the cost (lesson 2). The
12 * n_layers * d_model^2parameter formula here feeds straight into lesson 2’s6ND(compute) and16N(memory) accounting. -
Attention alternatives and mixture of experts (lesson 4). The next lesson varies the attention sublayer of this skeleton and adds the mixture-of-experts FFN variant, closing Phase 1.
-
Scaling laws (lesson 9). Choosing the hyperparameters here (depth vs width vs data for a fixed budget) is exactly what scaling laws decide with evidence.