Skip to content

References: self-attention from scratch

Source curriculum (structural mirror, cited as further study):
• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 7:
"Let's build GPT: from scratch, in code, spelled out."
Creator: Andrej Karpathy
Video: https://www.youtube.com/watch?v=kCc8FmEb1nY
Code repo (nanoGPT): https://github.com/karpathy/nanoGPT (MIT License)
Companion repo (ng-video-lecture): https://github.com/karpathy/ng-video-lecture (no explicit license)
Series page: https://karpathy.ai/zero-to-hero.html
License: nanoGPT is MIT-licensed; the ng-video-lecture companion repo carries no explicit license; the video is YouTube standard.
This lesson covers the first half of Lecture 7, where Karpathy builds
self-attention from scratch (the crude average, then query/key/value attention
with a causal mask) on a character-level Shakespeare GPT. Clawdemy's lessons are
original prose following the pedagogical arc of this series; we do not reproduce
or transcribe the video or code. The query-key dot-product example and the
causal-mask worked example here are ours, built to be checkable by hand. All
rights to the original video and code remain with the creator.
  • Let’s build GPT: from scratch, in code, spelled out (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. Karpathy starts from the bigram baseline, builds the crude averaging “communication,” then derives query-key-value self-attention with the causal mask, all live in code on a tiny Shakespeare dataset. Watching the masked attention matrix fill in as a lower triangle, and the generated text improve once attention is added, is the clearest way to make this lesson concrete.
  • Attention Is All You Need (Vaswani et al., 2017) (arXiv). The paper that introduced the transformer and the scaled dot-product attention this lesson builds. The phrase “attention is all you need” comes from its title; it is one of the most influential papers in modern AI.

  • nanoGPT on GitHub (MIT License). Karpathy’s compact, readable GPT implementation, the cleaned-up version of what the lecture builds. The attention module is the part to read after this lesson.

Where this sits in the curriculum.

  • How attention works (AI Foundations track). That lesson describes self-attention from the user’s side, the query-key-value library analogy, the similarity-scale-softmax formula, self versus cross attention. This lesson builds the same mechanism from scratch and adds the piece specific to GPT: the causal mask that forces “past only.” Read together, one gives the intuition and the other the construction.

  • The bigram model and BatchNorm lessons (this track). Attention reuses softmax from the bigram lesson (turning affinities into weights) and the scaling idea from the BatchNorm lesson (dividing affinities by sqrt(dimension) to keep softmax out of its saturated regime). If either step felt fast, those lessons are the grounding.