Self-attention from scratch: brief

What you’ll learn

This is the opener of Phase 3 (Building a transformer) in the Build Neural Networks from Scratch track, which follows the arc of Andrej Karpathy’s Neural Networks: Zero to Hero series. Phase 2 built a language model three ways; this phase builds the architecture that beat them all and powers every modern large language model: the transformer, in the form of a decoder-only GPT.

This lesson builds the transformer’s heart, self-attention. To predict the next token, each token must gather information from the earlier tokens, and (the rule that makes it a GPT) it may attend only to the past, never the future. The lesson starts from a crude uniform average, then derives real attention: each token emits a query and a key whose dot product is an affinity; future affinities are masked to negative infinity so softmax gives them zero weight; and the output is a weighted sum of the tokens’ values. It works a masked attention step by hand (a high-affinity future token gets erased by the mask) and shows that this learned, content-based routing, each token choosing who to listen to, is the mechanism behind every large language model.

Where this fits

This is lesson 1 of Phase 3, Building a transformer, and the first lesson to build the architecture behind today’s chatbots. Phase 2 ended with a hierarchical model (WaveNet) that combined context through a fixed tree; this lesson replaces that fixed combination with learned, data-dependent attention. It reuses softmax from the bigram lesson and the scaling idea from the BatchNorm lesson. The next lesson assembles this single attention computation into the full GPT, multiple heads in parallel, stacked into blocks with feed-forward layers and normalization, and trains it.

Before you start

Prerequisite (within this track): lesson 3, Your first language model: makemore (the bigram model), where softmax and the predict-the-next-token framing were established, both of which attention builds on directly. Two optional aids: the BatchNorm lesson (lesson 5) explains the saturation problem that the 1/sqrt(dimension) affinity scaling guards against, and the AI Foundations track’s How attention works lesson describes this same mechanism from the user’s side (the query-key-value analogy) if you would like the intuition before the construction. Neither is required; the lesson builds attention from scratch. No coding is needed to follow along, though Karpathy’s nanoGPT is the clean implementation to read afterward.

By the end, you’ll be able to

Explain why tokens must gather information from earlier tokens, and why the gathering must be causal (past only) in a GPT
Describe the crude averaging baseline and how self-attention makes its weights learned and data-dependent
Define query, key, and value and state how an attention weight is computed from them
Run a masked self-attention step by hand, showing how the causal mask zeroes out a high-affinity future token
Recognize that learned, content-based routing is the mechanism behind every large language model

Time and difficulty

Read time: about 13 minutes
Practice time: about 20 minutes (a masked self-attention step by hand, optionally confirmed in nanoGPT, plus flashcards)
Difficulty: standard