Skip to content

Summary: Attention and transformers, in brief

Recurrence reads a sequence one step at a time, which makes it slow (it cannot be parallelized) and forgetful over distance (early information fades). Attention is the idea that replaced it: for each position, look at every other position at once, weigh how relevant each is, and blend them, so long-range links are direct and the whole thing computes in parallel. A network built from attention is a transformer, the architecture behind modern language models. This is the brief tour; the deep mechanics live in Track 5. Here is the scan-it-in-five-minutes version.

  • Recurrence’s two costs both come from processing in order: speed (each step waits on the last, so no parallelism) and distance (information fades as it passes step to step). Both would vanish if every position could look directly at every other.
  • Attention does exactly that. For each position, it looks at all positions at once, scores their relevance, and builds a weighted blend favoring the relevant ones. In “the animal didn’t cross the street because it was too tired,” attention links “it” directly to “animal,” one hop, no fading relay.
  • A transformer is a network built from attention (plus the ordinary neurons, weights, and squishes you already know), with no recurrent loop. Every position attends to every other at each layer, in parallel.
  • Two advantages explain its dominance: it trains fast (parallel, so it rides the GPU compute wave from lesson 1 that recurrence could not use), and it handles long-range links gracefully (every position reaches every other directly). That is why transformers replaced recurrent networks and scaled into large language models.
  • Attention is not free. All-to-all comparison grows with the square of the sequence length, so very long inputs get expensive fast. This is why models have a context window, a limit on how much text they can consider at once; it is where the cost becomes impractical, not an arbitrary cap.
  • This is the brief tour, by design. The real machinery (how relevance weights are computed, attending in several ways at once, tracking word order without marching) lives in Track 5. The load-bearing idea here is “look at everything at once and weigh what matters.”

Almost every large language model you use is a transformer, so attention is the engine under the hood of modern AI text. This explains things you can feel: these models handle long, cross-referencing prompts well because any part of your prompt can attend directly to any other part, and the “context window” you keep hearing about is a direct consequence of attention’s square-law cost. Knowing the mechanism gives you a real handle on where these tools are strong and why they have the limits they do. The next lesson leaves sequences for the second problem shape, images, and the idea of wiring a network to see by looking at small local patches: the convolution.

Recurrence whispers a message down a long line of people and hopes it survives. Attention lets everyone in the room look at everyone else at once.