Self-attention, in brief

What you’ll learn

This is the opener of Phase 2 (How models think: the transformer architecture) in Track 5 (AI Foundations). The Stanford CME 295 course materials (syllabus, schedule, the Amidi cheatsheets) are at cme295.stanford.edu. Phase 1 left you with a sequence of dense, position-aware vectors, one per token, ready to flow into the model.

This lesson is the mechanism that turns those vectors into a model that knows which words go with which: self-attention. It opens on the canonical the animal didn’t cross the street because it was too tired example (your reading brain connects it to animal, not street, without conscious effort) and traces what RNNs structurally couldn’t do (long-range decay, no parallelism). It builds the query-key-value (Q-K-V) library analogy, walks the three-step formula (similarity / scale by √d_k / softmax-weighted sum), distinguishes self-attention from cross-attention by which sequence supplies each vector, and works one full attention computation by hand on three tokens so the formula stops being a black box.

Where this fits

This is lesson 1 of Phase 2, How models think: the transformer architecture, and the Phase 2 opener. Phase 1 traced a sentence from raw text through tokens, embeddings, and positional information into a sequence of dense vectors. This lesson covers what the model does with those vectors: the attention mechanism. The next lesson is Multi-head attention, which extends this single-head computation to many running in parallel. The rest of Phase 2 then builds out the wrapping pieces (transformer block, position embeddings inside attention via RoPE, normalization, attention efficiency tricks, encoder-decoder/T5, and BERT in two passes).

Before you start

Prerequisites: the Phase 1 lessons, especially How AI reads tokens and Embeddings. This lesson assumes you know what a token ID is and what an embedding vector represents. You don’t need prior ML background beyond that. If you’re rusty on what a dot product does, watch 3Blue1Brown’s “Dot products and duality” (about 14 minutes) before you start. It’s the one piece of math intuition the lesson assumes; everything else is explained inline.

By the end, you’ll be able to

Explain in plain language what attention does and why it replaced the sequence-by-sequence approach RNNs used
Distinguish self-attention from cross-attention by which sequence each of Q, K, and V comes from
Decompose the attention formula into the role each of its three inputs (query, key, value) plays in producing one score
Run the attention computation by hand on a small worked matrix of three tokens, and read the resulting softmax weights as percentages of attention
Recognize that attention weights are part of the computation, not a courtroom-quality explanation of why a model said what it said

Time and difficulty

Read time: about 25 minutes
Practice time: about 20 minutes (a worked attention computation on paper, plus flashcards)
Difficulty: standard