References: Attention alternatives and mixture of experts

Source material

Source curriculum (structural mirror, cited as further study):
• Stanford CS336, "Language Modeling from Scratch", Lecture 4:
    Attention alternatives and mixture of experts
  Instructors: Tatsunori Hashimoto and Percy Liang (Stanford)
  Course page: https://cs336.stanford.edu/
  Lecture videos: YouTube playlist
    https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV
  License: no explicit license is published on the course site; lecture
    videos are on YouTube under standard terms; slides are public on GitHub
    without a stated license.
  Required attribution: "Based on the structure of Stanford CS336,
    'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang
    (cs336.stanford.edu). This is an independent structural mirror in
    original prose; it reproduces no course materials, and Stanford does
    not endorse it."
This lesson mirrors the structure of Lecture 4 (attention alternatives and
mixture of experts). Clawdemy's lessons are original prose that follows the
pedagogical arc of the course. Because the source publishes no explicit
license, we cite it as a recommended companion and reproduce none of its
materials. All rights to the original course materials remain with their
creators.

Watch this next

Stanford CS336, Lecture 4: Attention alternatives and mixture of experts by Hashimoto and Liang. The lecture this lesson mirrors. It covers the attention variants and MoE routing in more depth, including the load-balancing details this lesson only names.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints” by Ainslie et al. (2023). The paper that introduced grouped-query attention, including how it interpolates between multi-head and multi-query and why the quality loss is small.
“Mixtral of Experts” by Jiang et al. (2024). A clear, concrete sparse-mixture-of-experts model whose report makes the total-versus-active-parameters bargain explicit. A good worked example of MoE in a real released model.
“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity” by Fedus, Zoph, and Shazeer (2021). A foundational MoE-at-scale paper, useful for the routing and load-balancing mechanics behind the idea.

Adjacent topics

Where this connects inside the track.

Counting the cost (lesson 2). Both variations are best read through lesson 2: GQA/MQA target the KV cache (memory and bandwidth), and MoE separates total parameters (memory) from active parameters (the 6ND compute).
The Transformer architecture (lesson 3). These are variations on lesson 3’s two sublayers (attention and FFN); the skeleton and residual stream are unchanged.
Inference (lesson 8). The KV cache that GQA shrinks is introduced fully in the inference lesson, where serving a trained model fast is the whole topic.