Assembling the full GPT: brief

What you’ll learn

This is lesson 2 of Phase 3 (Building a transformer) in the Build Neural Networks from Scratch track, which follows the arc of Andrej Karpathy’s Neural Networks: Zero to Hero series. The previous lesson built one self-attention computation; this lesson assembles it into the full GPT and trains it, the last build of the track.

You will add the pieces that turn attention into a transformer. Multi-head attention runs several attention heads in parallel, each tracking a different relationship, then concatenates them. The transformer block pairs multi-head attention (communication: tokens gather context) with a per-token feed-forward layer (computation: each token processes what it gathered). Residual connections and layer normalization make a deep stack of these blocks trainable, and positional embeddings give the otherwise order-blind model a sense of word order. The full GPT, token plus position embeddings feeding a stack of blocks into a final softmax head, is trained on next-token cross-entropy and generates text autoregressively. The lesson traces the shapes through a block and shows this is the exact architecture behind every modern chatbot.

Where this fits

This is lesson 2 of Phase 3, Building a transformer. The previous lesson built self-attention; this lesson wraps it into multi-head attention and the full block, then assembles and trains the GPT. It is the synthesis of the whole track: layer normalization echoes the BatchNorm lesson, stacking blocks echoes the WaveNet depth idea, residual connections lean on the autograd lesson’s fact that addition passes gradients through, and training uses the cross-entropy gradient from the backprop-ninja lesson. The final lesson builds the one remaining piece, the tokenizer, that turns raw text into the tokens a GPT consumes.

Before you start

Prerequisite (within this track): lesson 8, Building GPT: self-attention from scratch. This lesson assembles the single self-attention computation from there into multi-head attention and the full transformer block, so you need to know what query, key, value, and the causal mask are. Several earlier lessons return as components: the feed-forward MLP (Phase 1), layer normalization as the cousin of batch normalization (lesson 5), the depth-through-stacking idea (lesson 7, WaveNet), and the cross-entropy training gradient (lesson 6). If self-attention reads as a procedure, you are ready. No coding is required to follow along, though Karpathy’s nanoGPT is the clean implementation to read afterward.

By the end, you’ll be able to

Explain multi-head attention and why it runs heads in parallel rather than adding depth
Describe the transformer block as communication (attention) followed by computation (feed-forward)
Explain why residual connections and layer normalization are what make a deep stack trainable
Explain why a GPT needs positional embeddings, given that attention is order-blind
Lay out the full GPT from embeddings to softmax and describe how it is trained and used to generate text

Time and difficulty

Read time: about 13 minutes
Practice time: about 18 minutes (sizing a multi-head layer and tracing shapes through a block, optionally reading nanoGPT, plus flashcards)
Difficulty: standard