References: self-attention from scratch
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 7: "Let's build GPT: from scratch, in code, spelled out." Creator: Andrej Karpathy Video: https://www.youtube.com/watch?v=kCc8FmEb1nY Code repo (nanoGPT): https://github.com/karpathy/nanoGPT (MIT License) Companion repo (ng-video-lecture): https://github.com/karpathy/ng-video-lecture (no explicit license) Series page: https://karpathy.ai/zero-to-hero.html License: nanoGPT is MIT-licensed; the ng-video-lecture companion repo carries no explicit license; the video is YouTube standard.This lesson covers the first half of Lecture 7, where Karpathy buildsself-attention from scratch (the crude average, then query/key/value attentionwith a causal mask) on a character-level Shakespeare GPT. Clawdemy's lessons areoriginal prose following the pedagogical arc of this series; we do not reproduceor transcribe the video or code. The query-key dot-product example and thecausal-mask worked example here are ours, built to be checkable by hand. Allrights to the original video and code remain with the creator.Watch this next
Section titled “Watch this next”- Let’s build GPT: from scratch, in code, spelled out (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. Karpathy starts from the bigram baseline, builds the crude averaging “communication,” then derives query-key-value self-attention with the causal mask, all live in code on a tiny Shakespeare dataset. Watching the masked attention matrix fill in as a lower triangle, and the generated text improve once attention is added, is the clearest way to make this lesson concrete.
Going deeper
Section titled “Going deeper”-
Attention Is All You Need (Vaswani et al., 2017) (arXiv). The paper that introduced the transformer and the scaled dot-product attention this lesson builds. The phrase “attention is all you need” comes from its title; it is one of the most influential papers in modern AI.
-
nanoGPT on GitHub (MIT License). Karpathy’s compact, readable GPT implementation, the cleaned-up version of what the lecture builds. The attention module is the part to read after this lesson.
Adjacent topics
Section titled “Adjacent topics”Where this sits in the curriculum.
-
How attention works (AI Foundations track). That lesson describes self-attention from the user’s side, the query-key-value library analogy, the similarity-scale-softmax formula, self versus cross attention. This lesson builds the same mechanism from scratch and adds the piece specific to GPT: the causal mask that forces “past only.” Read together, one gives the intuition and the other the construction.
-
The bigram model and BatchNorm lessons (this track). Attention reuses softmax from the bigram lesson (turning affinities into weights) and the scaling idea from the BatchNorm lesson (dividing affinities by
sqrt(dimension)to keep softmax out of its saturated regime). If either step felt fast, those lessons are the grounding.