How transformers scale to real-world data: sliding windows and KV-cache savings
What you’ll learn
Section titled “What you’ll learn”This is lesson 6 of Phase 2 (How models think: the transformer architecture) in Track 5 (AI Foundations). The previous lessons covered the architectural changes the field made after 2017: position embeddings (RoPE) and normalization (pre-norm and RMSNorm). This lesson covers the third area of change: attention efficiency. Course materials are at cme295.stanford.edu.
The lesson keeps two distinct problems cleanly separated. The first is the compute problem: standard self-attention is O(n^2) in sequence length, which becomes a real bottleneck at the long contexts modern LLMs run on. The fix is sliding window attention: each token attends only to its local neighborhood, with the receptive field growing through layer stacking the same way it does in convolutional networks. The second is the memory problem: the KV cache that makes decoding fast also gets large fast as sequence length, head count, and layer depth grow. The fix is sharing key and value projections across attention heads, in the MHA -> MQA -> GQA progression (with DeepSeek’s MLA as a fourth point in 2026 production). The lecture treats the two as orthogonal efficiency moves, often combined in the same model. The lesson closes with how the 2026 context-window mainstream (1M-2M tokens normal, Llama 4 Scout’s 10M as the exception) hangs on these tricks together.
Where this fits
Section titled “Where this fits”This is lesson 6 of Phase 2, How models think: the transformer architecture. The previous lesson covered normalization (LayerNorm, pre-norm, RMSNorm). This lesson covers attention efficiency, closing the three-lesson “post-2017 changes that stuck” arc (position embeddings, normalization, attention efficiency). The next lesson, How transformers turn input into output: encoder-decoder and T5’s span corruption, opens the architectural-variants arc: what different kinds of transformer have been built and why.
Before you start
Section titled “Before you start”Prerequisites: the multi-head attention lesson and the transformer block lesson are required. We assume you understand what an attention head is, what Q, K, V projection matrices are, and what the attention computation looks like. If those terms feel unfamiliar, read the corresponding lesson first.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Explain why standard self-attention is
O(n^2)in compute and where that becomes a real bottleneck - Describe sliding window attention and the receptive-field-grows-with-depth analogy from CNNs that explains how local layers still see distant tokens through stacking
- Distinguish the compute problem (attention complexity) from the memory problem (KV cache size during decoding); explain why they need different fixes
- Walk through the MHA -> MQA -> GQA progression and explain why most modern LLMs land on GQA (and where DeepSeek’s MLA fits in)
- Recognize 2026 context-window claims correctly (1M-2M is mainstream, 10M is the Llama 4 Scout exception, sub-256K is short)
Time and difficulty
Section titled “Time and difficulty”- Read time: about 22 minutes
- Practice time: about 15 minutes (a complexity-counting exercise on a small attention matrix, plus a comparison of KV-cache sizes across MHA, MQA, and GQA configurations)
- Difficulty: standard