Skip to content

Summary: Inside the transformer: how attention decides which word goes with which

Self-attention is the mechanism that lets every word in a sentence look at every other word at once and decide how much each one matters. It is the heart of the transformer architecture, and it is what every modern AI you have used (ChatGPT, Claude, Gemini, and the rest) is built on. The lesson walks you through what the mechanism is doing in three steps, then shows the math worked out by hand on a tiny example. This summary is the scan-it-in-five-minutes version.

  • Self-attention is the answer to one question, asked once per word: for me, how much should I pay attention to every other word in this sentence? The answer is a set of weights that sum to 1.0 and tell the model how to blend the surrounding context into a refreshed version of me.
  • The architecture before transformers was the recurrent neural network (RNN), which processed sentences one word at a time while carrying a running summary in a hidden state. Two structural problems killed it at scale: long-range connections decayed across many sequential steps, and the sequential nature could not be parallelized on modern GPU hardware.
  • Transformers fix both problems with the same move: instead of a running summary passed forward word by word, every word directly looks at every other word in parallel.
  • Each word gets three vectors, derived from its embedding by three trained weight matrices (W_Q, W_K, W_V). The three vectors are the query (Q), the key (K), and the value (V).
  • The library analogy: query is your search index card, key is the catalog card on the spine of every book (designed to be matched against), value is the content card (designed to be read once you have decided this book is relevant). The librarian compares your query to every catalog card, scores the matches, and hands you back a weighted blend of the content cards.
  • The full attention formula is Attention(Q, K, V) = softmax( (Q · K^T) / √d_k ) · V. It looks dense; it is doing exactly three things in sequence.
  • Step 1, similarity: dot-product the current word’s query against every word’s key. The dot product is a scalar that grows when two vectors point in the same direction; high score means “this word’s key matches my search.”
  • Step 2, scale: divide every score by √d_k, the square root of the key dimension. This is a numerical-stability fix, not a conceptual move. Without it the scores in high-dimensional spaces grow large enough that the softmax in step 3 saturates and training breaks.
  • Step 3, softmax-weighted sum: softmax turns the scaled scores into weights that sum to 1.0 and emphasize the largest values. Multiply each word’s value vector by its weight and add the results. The output is the refreshed representation of the current word, blended from the context that turned out to be relevant.
  • Self-attention has all three vectors come from the same sentence (every word looks at every other word in the same sequence, including itself). Cross-attention has the queries come from one sequence and the keys and values from another (the classic example is a translation model’s decoder querying the encoded source language). The mechanic is identical; only the source of Q, K, and V differs.
  • The lesson’s worked example computes self-attention for the word it in “The animal didn’t cross the street because it was too tired” using made-up 4-dimensional vectors. The result: it pays roughly 51% of its attention to animal, 31% to itself, 19% to street. The ranking matches how a human reader resolves the pronoun.
  • A real transformer stacks many attention layers, runs several attention heads in parallel inside each layer, and processes every word at once rather than one at a time. The mechanic does not change; only the scale does.
  • Pitfalls worth naming: confusing self-attention with cross-attention (just check where Q, K, V come from); reading attention weights as faithful explanations of model behavior (they are part of the computation, not a guaranteed explanation of it); thinking attention is the entire transformer (it is not; layers also have feed-forward networks, residual connections, layer normalization, and positional encodings); thinking the model is “remembering” past tokens across calls (it is not; the transformer is stateless across calls).

Before this lesson, “attention” was a buzzword you saw in articles about AI. Now it is a specific three-step computation you can describe out loud. When the next news cycle frames an AI capability or limitation in terms of “attention,” you can read it critically instead of taking it on faith. And when the next four lessons in this course walk you through tokenization, embeddings, multi-head attention, and the full transformer block, you will already know where the new piece slots in: every one of them is supporting infrastructure for the mechanism you just understood.