Skip to content

Lesson: Attention and transformers, in brief

Last lesson left us with a working idea and a nagging weakness. Recurrence handles sequences by carrying a memory forward one step at a time, which is clever, but it reads strictly in order, so it is slow, and the memory of something said early tends to fade by the time the network has marched to the end. We ended on a promise: there is a different answer to the same problem, one that does not march at all. This lesson is that answer, in brief. It is called attention, and it is the idea that modern language models are built on.

A heads-up on scope: this is the quick tour. Attention and the transformers built from it are big enough to deserve their own track, and they have one (Track 5, Transformers and LLMs, goes deep on the mechanics). Here we just want the core intuition: what attention does, and why it swept recurrence aside.

Recurrence has two costs, and both come from the same source: it processes a sequence one element at a time, in order.

The first cost is speed. Because step five cannot start until step four has produced its memory, the work is forced to happen in sequence. You cannot spread it across many processors working at once, which is exactly the kind of parallel work the hardware behind deep learning is best at. A long sequence means a long, unavoidable chain of waiting.

The second cost is distance. For information from the first word to reach the last, it has to survive being passed hand to hand through every step in between, and as we saw, it tends to get washed out along the way. The further apart two related words are, the weaker the link between them.

Both problems would vanish if, instead of passing information down a long chain, every position could just look directly at every other position. No marching, no relay. That is the move attention makes.

The idea: look at everything, weigh what matters

Section titled “The idea: look at everything, weigh what matters”

Here is attention in one sentence. For each position in the sequence, the network looks at all the other positions at once, decides how relevant each one is, and builds its understanding of the current position as a weighted blend of the others, paying most attention to the ones that matter.

A concrete case makes it click. Read this sentence: “the animal didn’t cross the street because it was too tired.” What does “it” refer to? You know instantly: the animal (a street does not get tired). To understand the word “it,” your mind reached back and connected it to “animal,” several words away. Attention lets a network do exactly that: when processing “it,” the network looks across the whole sentence, assigns a high relevance weight to “animal” and low weights to the rest, and pulls in that meaning directly.

Notice what just happened. “It” connected to “animal” in a single, direct link, not by passing a fading memory through ten intermediate words. And the network can compute the relevance of every word to every other word all at the same time, rather than one step after another. The two costs of recurrence, distance and speed, are both gone in one stroke. That is why attention was such a leap.

If attention is so powerful, what happens if you build a network almost entirely out of it and drop recurrence altogether? You get a transformer.

A transformer is, in essence, layers of attention (plus the ordinary neuron machinery you already know, the weighted sums and squishes) stacked up, with no recurrent loop anywhere. Every position attends to every other at each layer, in parallel, and the network builds richer and richer representations as the layers go up. Because there is no sequential marching, transformers train fast on modern hardware, and because every position can reach every other directly, they handle long-range connections gracefully. Those two advantages are most of why transformers replaced recurrent networks for language and why they scaled into the large language models behind today’s AI assistants.

It is worth seeing how neatly this fits the story from the start of the track. Lesson 1 said the modern era was unlocked by depth, data, and compute arriving together, and that the hardware behind deep learning is built to do many small calculations in parallel. Recurrence could not fully use that hardware, because its steps had to happen in order. Attention can: every position’s relevance to every other can be computed at the same time. The transformer is, in part, the architecture that finally let sequence models ride the compute wave the rest of deep learning was already riding.

There is real machinery underneath this, how the network computes those relevance weights, how it attends in several different ways at once, how it keeps track of word order without marching through it. That machinery is exactly what Track 5 builds, piece by piece. For this survey, the load-bearing idea is enough: a transformer is a network that, instead of reading in order and remembering, looks at the whole sequence at once and weighs what matters.

Looking at everything at once has a price worth naming honestly. For every position to weigh every other position, the amount of work grows with the square of the sequence length: double the length and you roughly quadruple the comparisons. That is fine for a sentence or a page, but it gets expensive fast for very long inputs. This is the main reason models have a context window, a limit on how much text they can consider at once. The window is not an arbitrary cap; it is where the cost of all-to-all attention becomes impractical. A lot of recent research is about stretching that window without the cost exploding, which tells you how central this single tradeoff is to how modern AI behaves.

Almost every large language model you interact with is a transformer, so this lesson’s one idea, attention, is the engine under the hood of modern AI text. It explains a few things you can feel when you use these tools. They handle long, complex prompts with cross-references better than older systems did, because any part of your prompt can directly attend to any other part. And the much-discussed “context window,” the amount of text a model can consider at once, is a direct consequence of attention: every position attending to every other is powerful, but it gets more expensive as the sequence grows, which is part of why context windows have limits at all. Knowing that attention is the mechanism gives you a real handle on why these models are good at what they are good at.

Thinking attention is a kind of consciousness or focus. The word is a metaphor. “Attention” here is just computing relevance weights and taking a weighted blend of other positions. No awareness is involved, only arithmetic, the same multiply-and-add underneath.

Thinking transformers still read left to right like recurrence. They do not march. A transformer looks at the whole sequence at once and processes positions in parallel. (How it still knows word order is a real question, answered in Track 5.)

Thinking attention replaced neural networks. It did not. A transformer is still made of the neurons, weights, and squishes you already know; attention is a particular way of wiring them so positions can look at each other. Same engine, new arrangement.

Trying to learn the full mechanics from this lesson. This is the brief tour by design. The queries, keys, values, and multi-head details live in Track 5; reaching for them here is reaching past the lesson’s scope.

  • Recurrence is slow and forgetful because it processes a sequence in order: work cannot be parallelized, and distant information fades along the chain.
  • Attention fixes both at once. Each position looks at all positions directly and blends them by relevance, so long-range links are direct and the whole thing can be computed in parallel.
  • A transformer is a network built from attention (plus ordinary neurons), with no recurrence. That is the architecture behind modern large language models.
  • This is the brief tour. The actual mechanics of attention live in Track 5 (Transformers and LLMs); here the load-bearing idea is “look at everything at once and weigh what matters.”

Recurrence whispers a message down a long line of people and hopes it survives. Attention lets everyone in the room look at everyone else at once. That shift, from marching to looking, is what put transformers at the center of modern AI.

Next: we leave sequences behind and turn to the second problem shape, images. The next lesson is about how a network can be wired to see, looking at small local patches of an image instead of the whole thing at once. That idea is the convolution.