Attention efficiency, in brief

What you’ll learn

This is lesson 6 of Phase 2 (How models think: the transformer architecture) in Track 5 (AI Foundations). The previous lessons covered the architectural changes the field made after 2017: position embeddings (RoPE) and normalization (pre-norm and RMSNorm). This lesson covers the third area of change: attention efficiency. Course materials are at cme295.stanford.edu.

The lesson keeps two distinct problems cleanly separated. The first is the compute problem: standard self-attention is O(n^2) in sequence length, which becomes a real bottleneck at the long contexts modern LLMs run on. The fix is sliding window attention: each token attends only to its local neighborhood, with the receptive field growing through layer stacking the same way it does in convolutional networks. The second is the memory problem: the KV cache that makes decoding fast also gets large fast as sequence length, head count, and layer depth grow. The fix is sharing key and value projections across attention heads, in the MHA -> MQA -> GQA progression (with DeepSeek’s MLA as a fourth point in 2026 production). The lecture treats the two as orthogonal efficiency moves, often combined in the same model. The lesson closes with how the 2026 context-window mainstream (1M-2M tokens normal, Llama 4 Scout’s 10M as the exception) hangs on these tricks together.

Where this fits

This is lesson 6 of Phase 2, How models think: the transformer architecture. The previous lesson covered normalization (LayerNorm, pre-norm, RMSNorm). This lesson covers attention efficiency, closing the three-lesson “post-2017 changes that stuck” arc (position embeddings, normalization, attention efficiency). The next lesson, How transformers turn input into output: encoder-decoder and T5’s span corruption, opens the architectural-variants arc: what different kinds of transformer have been built and why.

Before you start

Prerequisites: the multi-head attention lesson and the transformer block lesson are required. We assume you understand what an attention head is, what Q, K, V projection matrices are, and what the attention computation looks like. If those terms feel unfamiliar, read the corresponding lesson first.

By the end, you’ll be able to

Explain why standard self-attention is O(n^2) in compute and where that becomes a real bottleneck
Describe sliding window attention and the receptive-field-grows-with-depth analogy from CNNs that explains how local layers still see distant tokens through stacking
Distinguish the compute problem (attention complexity) from the memory problem (KV cache size during decoding); explain why they need different fixes
Walk through the MHA -> MQA -> GQA progression and explain why most modern LLMs land on GQA (and where DeepSeek’s MLA fits in)
Recognize 2026 context-window claims correctly (1M-2M is mainstream, 10M is the Llama 4 Scout exception, sub-256K is short)

Time and difficulty

Read time: about 22 minutes
Practice time: about 15 minutes (a complexity-counting exercise on a small attention matrix, plus a comparison of KV-cache sizes across MHA, MQA, and GQA configurations)
Difficulty: standard