Attention and transformers: cheatsheet

The one idea that matters

recurrence: read in order, carry a memory  → slow + forgetful over distance
attention:  every position looks at every other at once, weighs relevance  → fast + direct
transformer = a network built from attention (no recurrence)

This is the brief tour. Track 5 (Transformers and LLMs) goes deep on the mechanics.

Why recurrence had to go

Cost	Cause
Slow	Steps happen in order; cannot be parallelized on modern hardware
Forgetful	Distant information fades passing hand-to-hand through every step

Both come from processing one element at a time, in order.

The attention idea

For each position: look at all positions at once, score how relevant each is, and build understanding as a weighted blend that favors the relevant ones.

Worked intuition: “the animal didn’t cross the street because it was too tired.” To understand “it,” attention links it directly to “animal” (high weight), ignoring the rest. One direct link, not a fading relay, and every word’s relevance to every other is computed at the same time.

From attention to the transformer

A transformer = layers of attention + ordinary neurons (weighted sums + squishes), no recurrent loop.
Direct long-range links + full parallelism → trains fast, handles long context, scaled into modern large language models.
Fits Lesson 1’s story: attention is what let sequence models finally use parallel compute.

The catch: not free

All-to-all attention grows with the square of sequence length (double the length, ~4x the work). That cost is the main reason models have a context window (a limit on text considered at once). Much current research is about widening it cheaply.

Pitfalls to dodge

“Attention is focus/awareness.” No. It is computing relevance weights and a weighted blend. Just arithmetic.
“Transformers read left to right.” No. They process the whole sequence in parallel. (Word order is tracked another way, see Track 5.)
“Attention replaced neural networks.” No. A transformer is still neurons + weights + squishes, wired so positions attend to each other.
“I can learn the full mechanics here.” No, by design. Queries/keys/values/multi-head live in Track 5.

Words to use precisely

Attention: scoring how relevant each position is to a given one, then blending by those weights.
Transformer: a network built from stacked attention layers, no recurrence.
Context window: how much text a model can attend to at once; limited by attention’s growing cost.

The one-line version

Recurrence whispers a message down a line and hopes it survives; attention lets everyone in the room look at everyone else at once.