Cheatsheet: Attention and transformers, in brief
The one idea that matters
Section titled “The one idea that matters”recurrence: read in order, carry a memory → slow + forgetful over distanceattention: every position looks at every other at once, weighs relevance → fast + directtransformer = a network built from attention (no recurrence)This is the brief tour. Track 5 (Transformers and LLMs) goes deep on the mechanics.
Why recurrence had to go
Section titled “Why recurrence had to go”| Cost | Cause |
|---|---|
| Slow | Steps happen in order; cannot be parallelized on modern hardware |
| Forgetful | Distant information fades passing hand-to-hand through every step |
Both come from processing one element at a time, in order.
The attention idea
Section titled “The attention idea”For each position: look at all positions at once, score how relevant each is, and build understanding as a weighted blend that favors the relevant ones.
Worked intuition: “the animal didn’t cross the street because it was too tired.” To understand “it,” attention links it directly to “animal” (high weight), ignoring the rest. One direct link, not a fading relay, and every word’s relevance to every other is computed at the same time.
From attention to the transformer
Section titled “From attention to the transformer”- A transformer = layers of attention + ordinary neurons (weighted sums + squishes), no recurrent loop.
- Direct long-range links + full parallelism → trains fast, handles long context, scaled into modern large language models.
- Fits Lesson 1’s story: attention is what let sequence models finally use parallel compute.
The catch: not free
Section titled “The catch: not free”All-to-all attention grows with the square of sequence length (double the length, ~4x the work). That cost is the main reason models have a context window (a limit on text considered at once). Much current research is about widening it cheaply.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- “Attention is focus/awareness.” No. It is computing relevance weights and a weighted blend. Just arithmetic.
- “Transformers read left to right.” No. They process the whole sequence in parallel. (Word order is tracked another way, see Track 5.)
- “Attention replaced neural networks.” No. A transformer is still neurons + weights + squishes, wired so positions attend to each other.
- “I can learn the full mechanics here.” No, by design. Queries/keys/values/multi-head live in Track 5.
Words to use precisely
Section titled “Words to use precisely”- Attention: scoring how relevant each position is to a given one, then blending by those weights.
- Transformer: a network built from stacked attention layers, no recurrence.
- Context window: how much text a model can attend to at once; limited by attention’s growing cost.
The one-line version
Section titled “The one-line version”Recurrence whispers a message down a line and hopes it survives; attention lets everyone in the room look at everyone else at once.