Multi-head attention, in brief

What you’ll learn

This is lesson 2 of Phase 2 (How models think: the transformer architecture) in Track 5 (AI Foundations). The full Stanford CME 295 course materials are at cme295.stanford.edu. The previous lesson covered single-head attention: the query, key, and value mechanism that lets the model figure out which words relate to which. But one head can only weight every token one way per token, so it has to choose which structure to track in a sentence that has many running through it at once.

Real transformers run many heads in parallel (typically 8 to 32 per layer), each with its own Q, K, V projections and its own perspective on the input. The lesson builds the one-head-isn’t-enough intuition (back to the animal-street-it example), walks the split-run-concatenate pattern that makes multi-head attention work (project to h smaller dimensions, run h attentions, concatenate, project through W_O), traces the dimension flow on a 12-head 768-dim example, and closes with what model cards mean by head counts and the 2026 production variants beyond vanilla MHA (MQA, GQA, and DeepSeek’s MLA).

Where this fits

This is lesson 2 of Phase 2, How models think: the transformer architecture. The previous lesson introduced the single-head attention mechanism. This lesson extends it to many heads running in parallel. The next lesson is The transformer block, which wraps multi-head attention with the remaining pieces (feed-forward network, residual connections, normalization) to complete one full layer of a real transformer.

Before you start

Prerequisites: the attention lesson is required. We build directly on query, key, and value vectors; if those terms feel unfamiliar, read the attention lesson first. The Phase 1 lessons (tokens, embeddings) give you the full picture of what flows into attention, but the attention lesson’s recap is sufficient if you are starting from Phase 2.

By the end, you’ll be able to

Explain why a single attention head is structurally limited at tracking multiple kinds of context at once
Describe the split-run-concatenate pattern that turns one big projection into h smaller parallel attention computations
Walk through the dimension flow in a multi-head computation, from input embedding to per-head outputs to the final concatenated result
Recognize multi-head configurations in real model specifications (e.g., “12 heads at 64 dim each, total 768”)
Identify the 2026 production attention variants (MQA, GQA, MLA) and what each one trades off relative to vanilla multi-head attention

Time and difficulty

Read time: about 22 minutes
Practice time: about 15 minutes (dimension arithmetic on paper, plus a quick inspection of a real model’s head count)
Difficulty: standard