Attention and transformers: brief

What you’ll learn

This lesson is the direct answer to the weakness the previous one ended on: recurrence is slow and forgetful over distance. The fix is attention, the idea that lets every position in a sequence look at every other position at once and weigh what matters, and the transformer is the architecture built from it. The source curriculum is MIT 6.S191, Lecture 2, by Alexander and Ava Amini, freely available at introtodeeplearning.com.

This is the survey treatment, by design. You will get the core intuition (what attention does, why it beat recurrence, what a transformer is, and why all-to-all attention gives rise to a context window) without the deep mechanics. Queries, keys, values, multi-head attention, and positional encoding are deliberately left to Track 5, which builds them piece by piece.

Where this fits

This is lesson 3 of 10, closing Phase 1 (Foundations and sequences). The previous lesson built recurrence and named its weaknesses; this lesson is the fix. For the full mechanics of attention and transformers, the dedicated track is Track 5 (Transformers and LLMs). The next lesson here leaves sequences for the second problem shape, images, with the convolution.

Before you start

Prerequisites: lesson 2 of this track (recurrence and its limits), which is the problem this lesson solves. The neural-network basics from the previous track are assumed. No new math is required.

About the math

None. This is the brief, intuition-level tour. The practice section is a pen-and-paper pronoun-resolution exercise (which word should “it” link to?), not arithmetic. The mathematical machinery of attention lives in Track 5.

By the end, you’ll be able to

Explain the two costs of recurrence (no parallelism, fading long-range links) that attention removes
Describe attention as each position looking at all positions at once and blending them by relevance
Define a transformer as a network built from attention with no recurrence, and why it scaled into modern language models
Explain why all-to-all attention’s cost gives rise to a context window

Time and difficulty

Read time: about 8 minutes
Practice time: about 10 minutes (a pronoun-resolution exercise that mirrors how attention links words, plus flashcards)
Difficulty: standard (survey-level; no math)