Inside the transformer: how attention decides which word goes with which
What you’ll learn
Section titled “What you’ll learn”This is the opener of Phase 2 (How models think: the transformer architecture) in Track 5 (AI Foundations). The Stanford CME 295 course materials (syllabus, schedule, the Amidi cheatsheets) are at cme295.stanford.edu. Phase 1 left you with a sequence of dense, position-aware vectors, one per token, ready to flow into the model.
This lesson is the mechanism that turns those vectors into a model that knows which words go with which: self-attention. It opens on the canonical the animal didn’t cross the street because it was too tired example (your reading brain connects it to animal, not street, without conscious effort) and traces what RNNs structurally couldn’t do (long-range decay, no parallelism). It builds the query-key-value (Q-K-V) library analogy, walks the three-step formula (similarity / scale by √d_k / softmax-weighted sum), distinguishes self-attention from cross-attention by which sequence supplies each vector, and works one full attention computation by hand on three tokens so the formula stops being a black box.
Where this fits
Section titled “Where this fits”This is lesson 1 of Phase 2, How models think: the transformer architecture, and the Phase 2 opener. Phase 1 traced a sentence from raw text through tokens, embeddings, and positional information into a sequence of dense vectors. This lesson covers what the model does with those vectors: the attention mechanism. The next lesson is Multi-head attention, which extends this single-head computation to many running in parallel. The rest of Phase 2 then builds out the wrapping pieces (transformer block, position embeddings inside attention via RoPE, normalization, attention efficiency tricks, encoder-decoder/T5, and BERT in two passes).
Before you start
Section titled “Before you start”Prerequisites: the Phase 1 lessons, especially How AI reads tokens and Embeddings. This lesson assumes you know what a token ID is and what an embedding vector represents. You don’t need prior ML background beyond that. If you’re rusty on what a dot product does, watch 3Blue1Brown’s “Dot products and duality” (about 14 minutes) before you start. It’s the one piece of math intuition the lesson assumes; everything else is explained inline.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Explain in plain language what attention does and why it replaced the sequence-by-sequence approach RNNs used
- Distinguish self-attention from cross-attention by which sequence each of Q, K, and V comes from
- Decompose the attention formula into the role each of its three inputs (query, key, value) plays in producing one score
- Run the attention computation by hand on a small worked matrix of three tokens, and read the resulting softmax weights as percentages of attention
- Recognize that attention weights are part of the computation, not a courtroom-quality explanation of why a model said what it said
Time and difficulty
Section titled “Time and difficulty”- Read time: about 25 minutes
- Practice time: about 20 minutes (a worked attention computation on paper, plus flashcards)
- Difficulty: standard