Assembling and training the full GPT
What you’ll learn
Section titled “What you’ll learn”This is lesson 2 of Phase 3 (Building a transformer) in the Build Neural Networks from Scratch track, which follows the arc of Andrej Karpathy’s Neural Networks: Zero to Hero series. The previous lesson built one self-attention computation; this lesson assembles it into the full GPT and trains it, the last build of the track.
You will add the pieces that turn attention into a transformer. Multi-head attention runs several attention heads in parallel, each tracking a different relationship, then concatenates them. The transformer block pairs multi-head attention (communication: tokens gather context) with a per-token feed-forward layer (computation: each token processes what it gathered). Residual connections and layer normalization make a deep stack of these blocks trainable, and positional embeddings give the otherwise order-blind model a sense of word order. The full GPT, token plus position embeddings feeding a stack of blocks into a final softmax head, is trained on next-token cross-entropy and generates text autoregressively. The lesson traces the shapes through a block and shows this is the exact architecture behind every modern chatbot.
Where this fits
Section titled “Where this fits”This is lesson 2 of Phase 3, Building a transformer. The previous lesson built self-attention; this lesson wraps it into multi-head attention and the full block, then assembles and trains the GPT. It is the synthesis of the whole track: layer normalization echoes the BatchNorm lesson, stacking blocks echoes the WaveNet depth idea, residual connections lean on the autograd lesson’s fact that addition passes gradients through, and training uses the cross-entropy gradient from the backprop-ninja lesson. The final lesson builds the one remaining piece, the tokenizer, that turns raw text into the tokens a GPT consumes.
Before you start
Section titled “Before you start”Prerequisite (within this track): lesson 8, Building GPT: self-attention from scratch. This lesson assembles the single self-attention computation from there into multi-head attention and the full transformer block, so you need to know what query, key, value, and the causal mask are. Several earlier lessons return as components: the feed-forward MLP (Phase 1), layer normalization as the cousin of batch normalization (lesson 5), the depth-through-stacking idea (lesson 7, WaveNet), and the cross-entropy training gradient (lesson 6). If self-attention reads as a procedure, you are ready. No coding is required to follow along, though Karpathy’s nanoGPT is the clean implementation to read afterward.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Explain multi-head attention and why it runs heads in parallel rather than adding depth
- Describe the transformer block as communication (attention) followed by computation (feed-forward)
- Explain why residual connections and layer normalization are what make a deep stack trainable
- Explain why a GPT needs positional embeddings, given that attention is order-blind
- Lay out the full GPT from embeddings to softmax and describe how it is trained and used to generate text
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 18 minutes (sizing a multi-head layer and tracing shapes through a block, optionally reading nanoGPT, plus flashcards)
- Difficulty: standard