Attention alternatives and MoE: brief

What you’ll learn

This lesson closes Phase 1 with the two variations that make modern LLMs efficient, one for each sublayer of the lesson-3 skeleton. The source curriculum is Stanford CS336, Lecture 4, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.

You will learn standard attention’s two cost problems (quadratic in sequence length, and the KV cache that dominates inference); how multi-query and grouped-query attention shrink that cache and how sliding-window attention bounds long-context cost; what a mixture-of-experts layer adds; the distinction between total and active parameters and MoE’s costs (memory and load balancing); and how to read these choices off a model’s spec.

Where this fits

This is lesson 4 of 14, the last lesson of Phase 1 (the model). It varies lesson 3’s two sublayers (attention and FFN) without changing the skeleton, and it is best understood through lesson 2’s cost accounting (KV cache as memory, total vs active parameters as memory vs compute). After it, Phase 2 turns to the systems that make all of this run fast: hardware, kernels, parallelism, and inference, where the KV cache introduced here returns in full.

Before you start

Prerequisites: lesson 3 (the skeleton and its attention and FFN sublayers, which this lesson varies) and lesson 2 (the cost accounting these variations are read through). Familiarity with multi-head attention (Track 5 or equivalent) helps, since MQA and GQA are modifications to it.

About the math

None. The variations are explained by what they do and which resource they save, with the reasoning grounded in lesson 2’s accounting (memory, bandwidth, compute). No new formulas; the one quantitative idea is the total-versus-active parameter split, which is arithmetic.

By the end, you’ll be able to

The single capability this lesson builds: explain the main alternatives to standard attention and what a mixture-of-experts layer adds. Concretely, you will be able to:

Explain standard attention’s quadratic and KV-cache cost problems
Distinguish MQA, GQA, and sliding-window attention and what each saves
Explain what a mixture-of-experts layer adds
Distinguish total from active parameters and MoE’s costs
Read attention and MoE choices off a model’s spec

Time and difficulty

Read time: about 13 minutes
Practice time: about 10 minutes (interpret total-vs-active specs and a GQA cache calculation, plus flashcards)
Difficulty: deep (Stage C; conceptual, read through lesson 2’s resource accounting)