Skip to content

Cheatsheet: How transformers scale to real-world data: sliding windows and KV-cache savings

Self-attention has two costs.
Compute → O(n^2) attention matrix → Sliding window attention
Memory → KV cache grows with n*H*L → MHA → MQA → GQA progression
Two problems, two fixes, often combined.
Problem 1: ComputeProblem 2: Memory
Where it bitesTraining and prefillDecoding (autoregressive generation)
What scales badlyAttention matrix: n^2 entriesKV cache: n * H * L vectors per K, same per V
Fix familySliding window attentionSharing K and V across heads (MHA, MQA, GQA)
Lecturer’s emphasisSpends time on this; gives Mistral exampleBrief; forwards full coverage to next Stanford lecture
PropertyDetail
First widely-cited paperLongformer (2020)
MechanismEach token attends only to a local neighborhood (a window of nearby tokens) instead of the full sequence
Window size in productionSeveral thousand tokens (small in slide illustrations)
ImplementationTiling-based; never materializes the full n * n matrix
Compute scalingO(n * w) instead of O(n^2), where w is the window size
Receptive fieldGrows with layer stacking, just like in convolutional neural networks
Layer mixingSome architectures interleave local and global attention layers
Lecture’s exampleMistral uses sliding window attention at every layer

Fix 2: KV-cache-friendly attention head sharing

Section titled “Fix 2: KV-cache-friendly attention head sharing”
Standard MHA: each head has its own (K, V) projections
cache = H * n * L * d_k vectors per K (and same per V)
MQA: all H heads share one (K, V) projection
cache = n * L * d_k vectors per K (factor of H smaller)
GQA: G groups of H/G heads each share (K, V)
cache = G * n * L * d_k vectors per K (factor of H/G smaller)
VariantK projections per layerCache size relative to MHALecturer’s framing
MHA (multi-head)H1× (baseline)The original transformer
MQA (multi-query)11/HMost aggressive savings; can cost quality
GQA (group-query)G (much smaller than H)G/H”Typically what you would see” in modern LLMs, with hedge
ReasonDetail
Diversity preservationQueries ask “what am I looking for”; different heads asking different questions is valuable. Keys and values are what the model is looking at; diversity loss from sharing them is smaller.
Practical motivationThe KV cache (not the Q projections) is what gets large at inference. Sharing K and V is where the memory savings actually land.
PhraseWhat it means
Sliding window attentionEach token attends to a local window; layers may interleave local and global; cheaper compute
GQA or Group-query attentionK and V shared within groups of heads; smaller KV cache; modern default per the lecturer
MQA or Multi-query attentionAll heads share K and V; aggressive savings; can cost quality
Standard MHA or Multi-head attentionOriginal transformer; every head has its own K and V; biggest cache
”128K context” or “long context”Almost always pairs with one or both efficiency tricks above
PitfallReality
Sliding window and MQA/GQA are the same efficiency storyNo. Sliding window is compute; MQA/GQA is memory. Different problems, can be combined.
Sliding window means the model can never see beyond the windowStacking expands the receptive field beyond a single window, just like in CNNs.
MQA is always better because it’s cheaperMQA can cost some quality. GQA is the modern compromise.
These tricks affect training and inference equallyKV cache is a decode-time concept. MQA/GQA primarily improve inference memory and latency. Training uses full attention regardless.
Vendors invented theseThey are research-driven (Longformer, multi-query attention, group-query attention papers) and adopted across many architectures.
  • Self-attention complexity: O(n^2) in sequence length, because the attention matrix has n^2 entries.
  • Sliding window attention: restricting each token’s attention to a local neighborhood of nearby tokens.
  • Tiling: an implementation pattern that computes only the entries inside the attention window, never materializing the full n * n matrix.
  • Receptive field (in this context): the set of tokens a token has effectively seen through the chain of layer-by-layer attention. Grows with stacking even when each layer is local.
  • KV cache: decode-time storage of K and V vectors from previous tokens, so they do not have to be recomputed at every generation step.
  • MHA (multi-head attention): every attention head has its own Q, K, V projections; the original transformer.
  • MQA (multi-query attention): all heads share one K and one V projection; aggressive memory savings.
  • GQA (group-query attention): heads grouped into G groups, each group sharing K and V; the modern compromise.
  • G (group count in GQA): typically much smaller than H (the head count); exact value depends on the architecture.

Sliding window attention is about compute.
MHA, MQA, and GQA are about memory.
Two problems, two fixes, often combined.