References: assembling and training the full GPT
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 7: "Let's build GPT: from scratch, in code, spelled out." Creator: Andrej Karpathy Video: https://www.youtube.com/watch?v=kCc8FmEb1nY Code repo (nanoGPT): https://github.com/karpathy/nanoGPT (MIT License) Companion repo (ng-video-lecture): https://github.com/karpathy/ng-video-lecture (no explicit license) Series page: https://karpathy.ai/zero-to-hero.html License: nanoGPT is MIT-licensed; the ng-video-lecture companion repo carries no explicit license; the video is YouTube standard.This lesson covers the second half of Lecture 7, where Karpathy assemblesself-attention into multi-head attention, the transformer block (with residualconnections and layer normalization), positional embeddings, and the full GPT,then trains it on character-level Shakespeare. Clawdemy's lessons are originalprose following the pedagogical arc of this series; we do not reproduce ortranscribe the video or code. The multi-head dimension split and theshape-trace example here are ours. All rights to the original video and coderemain with the creator.Watch this next
Section titled “Watch this next”- Let’s build GPT: from scratch, in code, spelled out (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. After deriving self-attention (the previous lesson), Karpathy builds multi-head attention, adds the feed-forward layer, wraps both in residual connections and layer normalization, adds positional embeddings, stacks the blocks into the full GPT, and trains it until it generates Shakespeare-flavored text. Watching the loss fall and the samples sharpen as each piece is added, residuals, layer norm, more blocks, makes every part’s contribution visible.
Going deeper
Section titled “Going deeper”-
nanoGPT on GitHub (MIT License). Karpathy’s clean, minimal GPT, the production-quality version of what the lecture builds. Reading
model.pyafter this lesson is the fastest way to confirm that a GPT really is just embeddings, a stack of blocks, a final norm, and a softmax head. -
Attention Is All You Need (Vaswani et al., 2017) (arXiv). The paper that introduced the transformer block, multi-head attention, and positional encodings this lesson assembles. The architecture you just built is a decoder-only variant of the one in this paper.
-
Neural Networks: Zero to Hero (full series) by Andrej Karpathy. The final lecture builds the tokenizer, the piece that turns raw text into the token IDs a GPT consumes.
Adjacent topics
Section titled “Adjacent topics”Where this sits in the curriculum.
-
The previous lesson (self-attention). This lesson assembles the single attention computation from there into multi-head attention and the full block. If the head-splitting felt fast, that lesson is the grounding for what each head does.
-
BatchNorm, WaveNet, and the backprop-ninja lessons (this track). The full GPT is a synthesis of the whole track: layer normalization is the per-token cousin of batch normalization; stacking blocks is the depth idea from WaveNet; residual connections lean on the fact that addition passes gradients through (autograd lesson); and training is the
p - ycross-entropy gradient from the backprop-ninja lesson. This lesson is where the pieces come together.