Skip to content

Attention alternatives and mixture of experts

This lesson closes Phase 1 with the two variations that make modern LLMs efficient, one for each sublayer of the lesson-3 skeleton. The source curriculum is Stanford CS336, Lecture 4, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.

You will learn standard attention’s two cost problems (quadratic in sequence length, and the KV cache that dominates inference); how multi-query and grouped-query attention shrink that cache and how sliding-window attention bounds long-context cost; what a mixture-of-experts layer adds; the distinction between total and active parameters and MoE’s costs (memory and load balancing); and how to read these choices off a model’s spec.

This is lesson 4 of 14, the last lesson of Phase 1 (the model). It varies lesson 3’s two sublayers (attention and FFN) without changing the skeleton, and it is best understood through lesson 2’s cost accounting (KV cache as memory, total vs active parameters as memory vs compute). After it, Phase 2 turns to the systems that make all of this run fast: hardware, kernels, parallelism, and inference, where the KV cache introduced here returns in full.

Prerequisites: lesson 3 (the skeleton and its attention and FFN sublayers, which this lesson varies) and lesson 2 (the cost accounting these variations are read through). Familiarity with multi-head attention (Track 5 or equivalent) helps, since MQA and GQA are modifications to it.

None. The variations are explained by what they do and which resource they save, with the reasoning grounded in lesson 2’s accounting (memory, bandwidth, compute). No new formulas; the one quantitative idea is the total-versus-active parameter split, which is arithmetic.

The single capability this lesson builds: explain the main alternatives to standard attention and what a mixture-of-experts layer adds. Concretely, you will be able to:

  • Explain standard attention’s quadratic and KV-cache cost problems
  • Distinguish MQA, GQA, and sliding-window attention and what each saves
  • Explain what a mixture-of-experts layer adds
  • Distinguish total from active parameters and MoE’s costs
  • Read attention and MoE choices off a model’s spec
  • Read time: about 13 minutes
  • Practice time: about 10 minutes (interpret total-vs-active specs and a GQA cache calculation, plus flashcards)
  • Difficulty: deep (Stage C; conceptual, read through lesson 2’s resource accounting)