Skip to content

Summary: self-attention from scratch

TL;DR. The final phase builds the transformer, and this lesson builds its heart, self-attention, in the GPT setting. To predict the next token, each token must gather information from the earlier tokens, but only the earlier ones (causal: never peek at the future, that would be seeing the answer). The crude version averages the past uniformly; self-attention makes the weights learned: each token emits a query and a key, their dot product is an affinity, future affinities are masked to -inf, softmax turns affinities into weights, and the output is a weighted sum of the tokens’ values. This learned, content-based routing is the mechanism behind every large language model.

  • Tokens must communicate, causally. Each token gathers context from earlier tokens to predict the next one, and may attend only to itself and the past. The “past only” rule is what makes the model autoregressive, a next-token generator.

  • The crude version averages the past. Each token becomes the uniform mean of itself and all previous tokens, a causal weighted sum with dumb (equal) weights. Self-attention keeps the shape and learns the weights.

  • Query, key, value. Each token produces three learned projections: a query (“what am I looking for?”), a key (“what do I contain?”), and a value (“what I contribute”). The affinity between tokens i and j is query_i · key_j; the output is a weighted sum of values. Key is for matching; value is what gets summed.

  • The causal mask makes it GPT. Set every future affinity to -inf before softmax, so e^(-inf) = 0 gives future tokens zero weight. Worked once: token 3 of 4 with affinities [1, 1, 1, 8] masks token 4 to get weights [0.333, 0.333, 0.333, 0], so the high-affinity future token contributes nothing. Affinities are scaled by 1/sqrt(dimension) to keep softmax trainable.

  • Learned routing is why transformers won. Unlike a uniform average or WaveNet’s fixed tree, each token decides from the data which earlier tokens matter (a pronoun attends to its noun). That content-based routing is the core of every LLM.

“Attention,” the word at the center of every conversation about modern AI, stops being a mystery and becomes a concrete computation: query-key dot products, masked so the future is unreachable, softmaxed into weights, used to blend values. When a model connects a detail at the end of a long prompt to something near the start, you can picture exactly how, every token reaching back over the context and pulling in what its query finds relevant. The AI Foundations track shows this from the outside; you have built the inside. The final lesson assembles this single attention computation into the full GPT, multiple heads in parallel, stacked into blocks with feed-forward layers and normalization, and trains it to generate text.