Skip to content

Cheatsheet: Attention and transformers, in brief

recurrence: read in order, carry a memory → slow + forgetful over distance
attention: every position looks at every other at once, weighs relevance → fast + direct
transformer = a network built from attention (no recurrence)

This is the brief tour. Track 5 (Transformers and LLMs) goes deep on the mechanics.

CostCause
SlowSteps happen in order; cannot be parallelized on modern hardware
ForgetfulDistant information fades passing hand-to-hand through every step

Both come from processing one element at a time, in order.

For each position: look at all positions at once, score how relevant each is, and build understanding as a weighted blend that favors the relevant ones.

Worked intuition: “the animal didn’t cross the street because it was too tired.” To understand “it,” attention links it directly to “animal” (high weight), ignoring the rest. One direct link, not a fading relay, and every word’s relevance to every other is computed at the same time.

  • A transformer = layers of attention + ordinary neurons (weighted sums + squishes), no recurrent loop.
  • Direct long-range links + full parallelism → trains fast, handles long context, scaled into modern large language models.
  • Fits Lesson 1’s story: attention is what let sequence models finally use parallel compute.

All-to-all attention grows with the square of sequence length (double the length, ~4x the work). That cost is the main reason models have a context window (a limit on text considered at once). Much current research is about widening it cheaply.

  • “Attention is focus/awareness.” No. It is computing relevance weights and a weighted blend. Just arithmetic.
  • “Transformers read left to right.” No. They process the whole sequence in parallel. (Word order is tracked another way, see Track 5.)
  • “Attention replaced neural networks.” No. A transformer is still neurons + weights + squishes, wired so positions attend to each other.
  • “I can learn the full mechanics here.” No, by design. Queries/keys/values/multi-head live in Track 5.
  • Attention: scoring how relevant each position is to a given one, then blending by those weights.
  • Transformer: a network built from stacked attention layers, no recurrence.
  • Context window: how much text a model can attend to at once; limited by attention’s growing cost.

Recurrence whispers a message down a line and hopes it survives; attention lets everyone in the room look at everyone else at once.