Multi-head attention: many lenses on the same sentence
What you’ll learn
Section titled “What you’ll learn”This is lesson 2 of Phase 2 (How models think: the transformer architecture) in Track 5 (AI Foundations). The full Stanford CME 295 course materials are at cme295.stanford.edu. The previous lesson covered single-head attention: the query, key, and value mechanism that lets the model figure out which words relate to which. But one head can only weight every token one way per token, so it has to choose which structure to track in a sentence that has many running through it at once.
Real transformers run many heads in parallel (typically 8 to 32 per layer), each with its own Q, K, V projections and its own perspective on the input. The lesson builds the one-head-isn’t-enough intuition (back to the animal-street-it example), walks the split-run-concatenate pattern that makes multi-head attention work (project to h smaller dimensions, run h attentions, concatenate, project through W_O), traces the dimension flow on a 12-head 768-dim example, and closes with what model cards mean by head counts and the 2026 production variants beyond vanilla MHA (MQA, GQA, and DeepSeek’s MLA).
Where this fits
Section titled “Where this fits”This is lesson 2 of Phase 2, How models think: the transformer architecture. The previous lesson introduced the single-head attention mechanism. This lesson extends it to many heads running in parallel. The next lesson is The transformer block, which wraps multi-head attention with the remaining pieces (feed-forward network, residual connections, normalization) to complete one full layer of a real transformer.
Before you start
Section titled “Before you start”Prerequisites: the attention lesson is required. We build directly on query, key, and value vectors; if those terms feel unfamiliar, read the attention lesson first. The Phase 1 lessons (tokens, embeddings) give you the full picture of what flows into attention, but the attention lesson’s recap is sufficient if you are starting from Phase 2.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Explain why a single attention head is structurally limited at tracking multiple kinds of context at once
- Describe the split-run-concatenate pattern that turns one big projection into
hsmaller parallel attention computations - Walk through the dimension flow in a multi-head computation, from input embedding to per-head outputs to the final concatenated result
- Recognize multi-head configurations in real model specifications (e.g., “12 heads at 64 dim each, total 768”)
- Identify the 2026 production attention variants (MQA, GQA, MLA) and what each one trades off relative to vanilla multi-head attention
Time and difficulty
Section titled “Time and difficulty”- Read time: about 22 minutes
- Practice time: about 15 minutes (dimension arithmetic on paper, plus a quick inspection of a real model’s head count)
- Difficulty: standard