References: Attention alternatives and mixture of experts
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Stanford CS336, "Language Modeling from Scratch", Lecture 4: Attention alternatives and mixture of experts Instructors: Tatsunori Hashimoto and Percy Liang (Stanford) Course page: https://cs336.stanford.edu/ Lecture videos: YouTube playlist https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV License: no explicit license is published on the course site; lecture videos are on YouTube under standard terms; slides are public on GitHub without a stated license. Required attribution: "Based on the structure of Stanford CS336, 'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang (cs336.stanford.edu). This is an independent structural mirror in original prose; it reproduces no course materials, and Stanford does not endorse it."This lesson mirrors the structure of Lecture 4 (attention alternatives andmixture of experts). Clawdemy's lessons are original prose that follows thepedagogical arc of the course. Because the source publishes no explicitlicense, we cite it as a recommended companion and reproduce none of itsmaterials. All rights to the original course materials remain with theircreators.Watch this next
Section titled “Watch this next”- Stanford CS336, Lecture 4: Attention alternatives and mixture of experts by Hashimoto and Liang. The lecture this lesson mirrors. It covers the attention variants and MoE routing in more depth, including the load-balancing details this lesson only names.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints” by Ainslie et al. (2023). The paper that introduced grouped-query attention, including how it interpolates between multi-head and multi-query and why the quality loss is small.
-
“Mixtral of Experts” by Jiang et al. (2024). A clear, concrete sparse-mixture-of-experts model whose report makes the total-versus-active-parameters bargain explicit. A good worked example of MoE in a real released model.
-
“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity” by Fedus, Zoph, and Shazeer (2021). A foundational MoE-at-scale paper, useful for the routing and load-balancing mechanics behind the idea.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
Counting the cost (lesson 2). Both variations are best read through lesson 2: GQA/MQA target the KV cache (memory and bandwidth), and MoE separates total parameters (memory) from active parameters (the
6NDcompute). -
The Transformer architecture (lesson 3). These are variations on lesson 3’s two sublayers (attention and FFN); the skeleton and residual stream are unchanged.
-
Inference (lesson 8). The KV cache that GQA shrinks is introduced fully in the inference lesson, where serving a trained model fast is the whole topic.