input → split into h heads → run h attentions → concat → W_O → output
(each head at d_k = d_model / h)
Real transformer attention is multi-head. h heads run in parallel; outputs concatenate and project. The single-head story from the attention lesson was the simplification.
One head Many heads (h) Weightings produced per token 1 hCaptures simultaneous structures (syntactic, coreference, positional, semantic) One at a time Up to h at once Q, K, V dimensiond_modeld_k = d_model / h per headCombined output One vector Concat of h vectors, then W_O
Stage Shape Input embedding 768 Per head Q, K, V 64 each Per head attention output 64 Concatenated (12 × 64) 768 After W_O 768
Shape in equals shape out. That is what lets transformers stack many such layers.
head_i = Attention(X · W_Q^i, X · W_K^i, X · W_V^i)
MultiHead(X) = Concat(head_1, ..., head_h) · W_O
Term in a model card What it means Why you care num_attention_headsh, the number of parallel attention heads per layerSets representational capacity per layer hidden_size (or d_model, n_embd)The main embedding dimension Combined with h, gives d_k num_key_value_headsIf smaller than num_attention_heads, K and V are shared across heads (MQA or GQA) Inference-cost optimization; keys and values are shared, queries are not
Pitfall Reality Heads equal layers Heads run in parallel inside one layer; layers stack vertically. 12 layers × 12 heads = 144 attention computations per forward pass. Each head is human-interpretable Some are; most are not. Treat heads as structural mechanism, not as named lenses. More heads is always better d_k shrinks as h grows. Past a point, each head has too little to work with.Multi-head only applies to self-attention Works equally for self-attention and cross-attention. The trick is orthogonal to the self-versus-cross distinction. Multi-head is mixture of experts Multi-head varies the attention; MoE varies the feed-forward network. Different mechanisms, different parts of the layer.
Head: one independent attention computation, with its own W_Q, W_K, W_V.
h: the number of heads in an attention layer. Practical models typically cluster at 8 to 32.
d_model: the main embedding dimension carried into and out of the layer.
d_k: the per-head dimension; d_k = d_model / h.
W_O: the final output projection that mixes head outputs into the layer’s overall output.
MultiHead(X): the operation as a whole; Concat(head_1, ..., head_h) · W_O.
MQA / GQA: multi-query and grouped-query attention; share keys and values across heads (or groups of heads) to cut inference cost.
One head asks one question.
Many heads ask many, all at once.