Skip to content

Cheatsheet: Inside the transformer: how attention decides which word goes with which

The formula (memorize the shape, not the symbols)

Section titled “The formula (memorize the shape, not the symbols)”
Attention(Q, K, V) = softmax( (Q · K^T) / √d_k ) · V

Three things in sequence: similarity, scale, softmax-weighted sum.

StepOperationWhat it does
1. SimilarityQ · K^TDot-product the current token’s query against every token’s key. High score = “this token’s key matches my search.”
2. Scale÷ √d_kDivide by the square root of the key dimension. Numerical-stability fix. Without it, softmax saturates and training breaks.
3. Softmax-weighted sumsoftmax(...) · VConvert scaled scores into weights summing to 1.0, then take the weighted blend of every token’s value vector.
VectorSymbolComes fromJob
QueryQembedding × W_QWhat this token is asking about
KeyKembedding × W_KThe label other tokens match against
ValueVembedding × W_VThe information that gets blended in once judged relevant

W_Q, W_K, W_V are learned during training. Each token gets its own Q, K, V by passing its embedding through them.

ConceptLibrary object
QueryYour search index card
KeyThe catalog card on the spine of every book (built to be matched)
ValueThe content card you read once you have decided this book is relevant
The librarianThe attention computation itself: scores every catalog card against your query, hands back a weighted blend of content cards
Where Q comes fromWhere K, V come from
Self-attentionSame sequenceSame sequence
Cross-attentionOne sequenceA different sequence

The mechanic is identical. Only the source differs. (Classic cross-attention example: a translation decoder querying the encoded source language.)

The worked numbers (sentence: “The animal didn’t cross the street because it was too tired”)

Section titled “The worked numbers (sentence: “The animal didn’t cross the street because it was too tired”)”

For the token it against tokens animal, street, it, with 4-D vectors:

TokenRaw scoreScaled (÷ √4 = 2)Softmax weight
animal31.50.51
street10.50.19
it21.00.31

Output vector for it: [1.33, 0.82, 0.50, 0.50]. The model “decided” that it refers most strongly to animal, just as a human reader would.

What “stacked” means in a real transformer

Section titled “What “stacked” means in a real transformer”

Same mechanism, more of it.

  • Many layers stacked vertically (the output of one attention layer becomes the input to the next).
  • Multiple heads running in parallel inside each layer (each head learns its own W_Q, W_K, W_V and so attends to a different pattern).
  • Every token at once, not one at a time. This is the parallelism that killed RNNs.

The arithmetic does not change. Only the scale does.

  • Confusing self-attention with cross-attention. Just check where Q, K, V come from.
  • Reading attention weights as faithful explanations of model behavior. They are part of the computation, not a guaranteed explanation of it.
  • Thinking “more attention weight = more important.” It means “more contribution to this output vector at this layer,” nothing more.
  • Thinking attention is the entire transformer. It is not. Layers also have feed-forward networks, residual connections, layer normalization, and positional encodings.
  • Thinking the model is “remembering” past tokens across calls. It is not. The transformer is stateless across calls; conversation history is re-sent as input tokens by the chat UI on every request.
  • Token: the unit the model actually processes. Often a whole word, sometimes a fragment.
  • Embedding: the numeric vector that represents a token. Comes from a lookup table at the start of the model.
  • Attention weight: a scalar in [0, 1] that tells the model how much one token’s value should contribute to another token’s updated representation.
  • Stateless across calls: each API call starts fresh; no “memory” persists unless the chat UI re-sends prior turns as input tokens.