References: Inside the transformer: how attention decides which word goes with which

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025, Lecture 1
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  YouTube: https://www.youtube.com/watch?v=Ub3GoFaUcds
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  License (lecture video): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
Clawdemy provides original notes, summaries, and quizzes derived from this material
for educational purposes. All rights to the original lecture remain with Stanford and
the instructors.

Going deeper

A short list, chosen for durability. Each link is for a specific next step, not a generic “learn more” pile.

The Illustrated Transformer by Jay Alammar. The most widely-shared visual walkthrough of the transformer architecture on the public web, and for good reason. Where this lesson keeps the math minimal, Alammar draws the matrices in full color and shows how attention scales up across heads and layers. If you want to see the same mechanism rendered with diagrams, start here.
But what is a GPT? Visual intro to transformers by 3Blue1Brown. The first of a multi-part series. Grant Sanderson is unmatched at building intuition for the linear-algebra moves underneath attention. Watch the attention episode (chapter 6 of the series) right after this lesson. Twenty-six minutes, no math prerequisites beyond what you have here.
The Annotated Transformer from Harvard NLP. The original Vaswani et al. paper, line by line, with runnable PyTorch beside each section. This is the bridge from “I understand the mechanism” to “I can read transformer code.” Skip the training section on a first pass; read the architecture and attention sections.
Stanford CME 295 syllabus. The full nine-lecture course this lesson adapts from. Lectures 2 through 9 cover tokenization, embeddings, multi-head attention, decoder-only models, mixture of experts, fine-tuning, RLHF, and inference-time techniques. We will adapt several more of these lectures into Clawdemy lessons; the course is also worth following directly if you want the technical version end to end.

Adjacent topics

Topics that build on or sit beside this one. Some are upcoming Clawdemy lessons; some are pointers outside the course.

Tokenization, embeddings, multi-head attention, the full transformer block. The next four lessons in this Stanford-adapted course. Each one is the supporting infrastructure for the mechanism you just learned: tokenization is what produces the input units, embeddings are what gives them their first numeric form, multi-head attention is “run this lesson’s mechanism in parallel many times per layer,” and the transformer block is what wraps attention together with feed-forward and normalization.
“The Bitter Lesson” by Rich Sutton (2019). One page. Argues that the methods that win in AI are the ones that scale with compute, not the ones that encode human cleverness. The transformer is the textbook case: a simple, parallelizable mechanism that scaled past everything more clever before it. Read it once a year.

Original sources

The primary sources this lesson draws from.

“Attention Is All You Need”, Vaswani et al., NeurIPS 2017. The original transformer paper. Section 3.2 (“Scaled Dot-Product Attention”) is the formula this lesson worked through. Section 3.2.2 explains multi-head attention. The paper is dense but readable; if you have made it through this lesson, you can read sections 3.1 and 3.2.
“Neural Machine Translation by Jointly Learning to Align and Translate”, Bahdanau et al., 2014. The paper that introduced the attention mechanism (in the encoder-decoder RNN setting, before transformers existed). The Vaswani paper’s title is in dialogue with this one: Bahdanau showed attention could help an RNN; Vaswani showed attention could replace it.
“The Unreasonable Effectiveness of Recurrent Neural Networks” by Andrej Karpathy, 2015. The state of the art the year before transformers landed. Worth reading for the contrast: this is what people thought the future of language modeling looked like, two years before the transformer paper.

Community discussion

None selected for this lesson. The public discussion of “how attention works” has been thoroughly absorbed into the resources above; the marginal Reddit or Hacker News thread does not add durable value over the Alammar post or the 3Blue1Brown video. If a canonical thread surfaces, it will be added at the next quarterly review.