Practice: Attention and transformers, in brief
Self-check
Section titled “Self-check”Seven short questions. Try to answer each in your head (or on paper) before opening the collapsible. Active retrieval is where the learning sticks; rereading is comfortable but does much less.
1. Recurrence has two costs that both come from processing a sequence one step at a time. Name them.
Show answer
Speed: because each step waits on the previous one, the work cannot be spread across many processors at once (it cannot be parallelized). Distance: information from an early word has to survive being passed hand to hand through every step, so it fades the further it has to travel. Attention fixes both at once.
2. State the core idea of attention in one sentence.
Show answer
For each position in a sequence, the network looks at all the other positions at once, decides how relevant each one is, and builds its understanding of that position as a weighted blend of the others, paying most attention to the ones that matter.
3. In “the animal didn’t cross the street because it was too tired,” what does attention link “it” to, and why is that better than how recurrence would handle it?
Show answer
To animal (a street does not get tired). Attention forms a single direct link from “it” to “animal,” regardless of how many words sit between them, instead of passing a fading memory hand to hand through every intermediate word the way recurrence would. Direct, not a relay.
4. What is a transformer, in one sentence?
Show answer
A network built from layers of attention (plus the ordinary neurons, weights, and squishes you already know), with no recurrent loop. Every position attends to every other at each layer, in parallel.
5. Why do transformers train faster than recurrent networks on modern hardware?
Show answer
Because attention is not sequential. Every position’s relevance to every other can be computed at the same time, so the work spreads across parallel hardware (GPUs). Recurrence had to march one step after another and could not use that parallelism, which (tying back to lesson 1) is exactly what the hardware behind deep learning is best at.
6. Why do language models have a “context window,” a limit on how much text they can consider at once?
Show answer
Because all-to-all attention gets expensive: for every position to weigh every other, the work grows with the square of the sequence length (double the length, roughly quadruple the comparisons). The window is the point where that cost becomes impractical, not an arbitrary cap.
7. Fill in the blank. “Recurrence ______ a message down a long line of people and hopes it survives. Attention lets everyone in the room ______ at once.”
Show answer
Whispers and look at everyone else. The shift from marching (passing a message along) to looking (everyone seeing everyone directly) is what put transformers at the center of modern AI.
Try it yourself: be the attention mechanism
Section titled “Try it yourself: be the attention mechanism”No tool, no math, just the it→animal idea applied with a pen. About 10 minutes. The point is to feel what attention has to do: for a tricky word, look across the whole sentence and decide which other word it should link to.
Part A: resolve the pronoun. For each sentence, which earlier word should “it” link to? (These are deliberately tricky; the answer depends on meaning, not position.)
- “The trophy didn’t fit in the suitcase because it was too big.”
- “The trophy didn’t fit in the suitcase because it was too small.”
What you should notice
In (1), “it” is the trophy (too big to fit). In (2), the exact same sentence frame, “it” is the suitcase (too small to hold the trophy). One word changed and the correct link flipped. That is precisely the job attention does: look across the whole sentence and weigh which word “it” relates to, using the surrounding meaning, not just nearness. A purely position-based or short-memory approach would struggle; a direct, content-weighted link handles it.
Part B: spot the long-range link. In this sentence, which earlier word does “their” depend on, and how many words back is it?
“The researchers who published the controversial study last spring, after years of quiet work, finally agreed to defend their findings in public.”
What you should notice
“Their” links back to researchers, near the very start, with a long distracting clause in between. Recurrence would have to carry that link through every intervening word (and risk it fading); attention connects “their” to “researchers” directly, in one hop, no matter the distance. This is the “distance” cost from the lesson, solved.
Part C: why not infinite context? If attention is so good at long-range links, why can’t you just feed a model an entire library at once?
What you should notice
Because every position attends to every other, so the work grows with the square of the length. A page is fine; a whole library would be astronomically expensive. That square-law cost is the reason a context window exists at all, and why widening it cheaply is an active research problem.
Flashcards
Section titled “Flashcards”Ten cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.
Q. What two costs of recurrence does attention fix?
Speed (recurrence is sequential and cannot be parallelized) and distance (information fades as it passes step to step). Attention removes both by letting every position look at every other directly, all at once.
Q. What is the core idea of attention?
For each position, look at all positions at once, score how relevant each is, and build understanding as a weighted blend that favors the relevant ones.
Q. What does the it→animal example show?
In “the animal didn’t cross the street because it was too tired,” attention links “it” directly to “animal” (a street does not get tired), forming a single direct link rather than a fading relay through intermediate words.
Q. What is a transformer?
A network built from stacked layers of attention (plus ordinary neurons, weights, and squishes), with no recurrence. Every position attends to every other, in parallel.
Q. Why do transformers train faster than RNNs?
Attention is parallel, not sequential: every position’s relevance to every other can be computed at the same time, so it uses parallel hardware (GPUs) that recurrence’s step-by-step march could not.
Q. Why does all-to-all attention have a cost?
For every position to weigh every other, the work grows with the square of the sequence length (double the length, about quadruple the comparisons). That cost is why context windows exist.
Q. What is a context window?
The limit on how much text a model can consider at once. It exists because attention’s all-to-all cost grows with the square of length and becomes impractical for very long inputs.
Q. Is attention a kind of consciousness or focus?
No. The word is a metaphor. Attention is computing relevance weights and taking a weighted blend, just arithmetic, the same multiply-and-add underneath.
Q. Did attention replace neural networks?
No. A transformer is still neurons, weights, and squishes; attention is a particular way of wiring them so positions can look at each other. Same engine, new arrangement.
Q. What is the one-sentence takeaway of this lesson?
Recurrence whispers a message down a long line and hopes it survives; attention lets everyone in the room look at everyone else at once.