Queries ask “what am I looking for”; different heads asking different questions is valuable. Keys and values are what the model is looking at; diversity loss from sharing them is smaller.
Practical motivation
The KV cache (not the Q projections) is what gets large at inference. Sharing K and V is where the memory savings actually land.
Self-attention complexity:O(n^2) in sequence length, because the attention matrix has n^2 entries.
Sliding window attention: restricting each token’s attention to a local neighborhood of nearby tokens.
Tiling: an implementation pattern that computes only the entries inside the attention window, never materializing the full n * n matrix.
Receptive field (in this context): the set of tokens a token has effectively seen through the chain of layer-by-layer attention. Grows with stacking even when each layer is local.
KV cache: decode-time storage of K and V vectors from previous tokens, so they do not have to be recomputed at every generation step.
MHA (multi-head attention): every attention head has its own Q, K, V projections; the original transformer.
MQA (multi-query attention): all heads share one K and one V projection; aggressive memory savings.
GQA (group-query attention): heads grouped into G groups, each group sharing K and V; the modern compromise.
G (group count in GQA): typically much smaller than H (the head count); exact value depends on the architecture.
Sliding window attention is about compute. MHA, MQA, and GQA are about memory. Two problems, two fixes, often combined.