Skip to content

References: Multi-head attention: many lenses on the same sentence

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025, Lecture 1
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
YouTube: https://www.youtube.com/watch?v=Ub3GoFaUcds
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
License (lecture video): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
Clawdemy provides original notes, summaries, and quizzes derived from this material
for educational purposes. All rights to the original lecture remain with Stanford and
the instructors.

A short list, chosen for durability. Each link is a specific next step, not a generic “learn more.”

  • The Illustrated Transformer by Jay Alammar. The single best visual treatment of multi-head attention on the public web. Where this lesson keeps the math minimal, Alammar shows the actual matrix multiplications side by side, and the “Why are heads important?” section spells out the same intuition with diagrams. Read it after this lesson if you want to see every matrix shape drawn out.

  • The Annotated Transformer from Harvard NLP. Section “Multi-head attention” reproduces the multi-head implementation in PyTorch alongside the original paper’s text. The clearest path from the formula in this lesson to the actual code that runs in production transformers.

  • BertViz by Jesse Vig. An interactive tool that visualizes attention weights per head in real transformer models. Open it on a sentence and click through individual heads. Some will attend to recognizable patterns; most will not. Useful as the empirical companion to the lesson’s hedged claim about head interpretability.

  • 3Blue1Brown’s transformer series. Grant Sanderson’s animations cover attention and multi-head with the rigor you would expect; episode 6 in particular shows multiple heads attending to different patterns in the same sentence. Watch after reading this lesson for the geometric intuition.

  • Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. Useful as a single-page recall sheet right before the next lesson on the full transformer block; the architecture diagram puts multi-head in context with the rest of the layer.

Topics that build on or sit beside this one.

  • The full transformer block. The next lesson in this Stanford-adapted course. Once multi-head attention is in your head, the full block (attention + feed-forward + residual connections + layer normalization + position information) is what wraps around it to make a complete transformer layer.

  • Head interpretability research. Whether individual heads correspond to human-readable functions (syntax, coreference, etc.) is an active and unresolved question. The canonical entry point is “Are Sixteen Heads Really Better than One?” (Michel et al. 2019), the paper that showed most heads can be pruned without much performance loss. Read it for the empirical case behind this lesson’s hedged stance on head interpretability.

  • Inference-cost optimizations: MQA and GQA. When you read about a recent open-source model architecture and the model card lists num_key_value_heads smaller than num_attention_heads, that is MQA or GQA in action. The two foundational papers below are the reference.

The primary sources for the mechanism this lesson covers.

  • “Attention Is All You Need”, Vaswani et al., NeurIPS 2017. The transformer paper. Section 3.2.2 (“Multi-Head Attention”) is the formal definition of the mechanism this lesson worked through; equation 3 in that section is the one we wrote out as MultiHead(X) = Concat(head_1, ..., head_h) · W_O.

  • “Fast Transformer Decoding: One Write-Head is All You Need”, Shazeer 2019. The multi-query attention (MQA) paper. Argues that you can share K and V across all heads (only Q stays per-head) and recover most of the performance at much lower inference cost. The first major rethinking of multi-head since the 2017 paper.

  • “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”, Ainslie et al., 2023. The grouped-query attention (GQA) paper. Sits between full multi-head and MQA: K and V are shared across groups of heads rather than across all heads, recovering more performance than MQA at slightly higher cost. The architecture most recent open-source models use.

None selected for this lesson. The public discussion of multi-head attention has consolidated into the Alammar visual post, the Annotated Transformer, and the academic papers above; the marginal Reddit or Hacker News thread does not add durable value over those. If a canonical thread surfaces, it will be added at the next quarterly review.