Skip to content

References: assembling and training the full GPT

Source curriculum (structural mirror, cited as further study):
• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 7:
"Let's build GPT: from scratch, in code, spelled out."
Creator: Andrej Karpathy
Video: https://www.youtube.com/watch?v=kCc8FmEb1nY
Code repo (nanoGPT): https://github.com/karpathy/nanoGPT (MIT License)
Companion repo (ng-video-lecture): https://github.com/karpathy/ng-video-lecture (no explicit license)
Series page: https://karpathy.ai/zero-to-hero.html
License: nanoGPT is MIT-licensed; the ng-video-lecture companion repo carries no explicit license; the video is YouTube standard.
This lesson covers the second half of Lecture 7, where Karpathy assembles
self-attention into multi-head attention, the transformer block (with residual
connections and layer normalization), positional embeddings, and the full GPT,
then trains it on character-level Shakespeare. Clawdemy's lessons are original
prose following the pedagogical arc of this series; we do not reproduce or
transcribe the video or code. The multi-head dimension split and the
shape-trace example here are ours. All rights to the original video and code
remain with the creator.
  • Let’s build GPT: from scratch, in code, spelled out (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. After deriving self-attention (the previous lesson), Karpathy builds multi-head attention, adds the feed-forward layer, wraps both in residual connections and layer normalization, adds positional embeddings, stacks the blocks into the full GPT, and trains it until it generates Shakespeare-flavored text. Watching the loss fall and the samples sharpen as each piece is added, residuals, layer norm, more blocks, makes every part’s contribution visible.
  • nanoGPT on GitHub (MIT License). Karpathy’s clean, minimal GPT, the production-quality version of what the lecture builds. Reading model.py after this lesson is the fastest way to confirm that a GPT really is just embeddings, a stack of blocks, a final norm, and a softmax head.

  • Attention Is All You Need (Vaswani et al., 2017) (arXiv). The paper that introduced the transformer block, multi-head attention, and positional encodings this lesson assembles. The architecture you just built is a decoder-only variant of the one in this paper.

  • Neural Networks: Zero to Hero (full series) by Andrej Karpathy. The final lecture builds the tokenizer, the piece that turns raw text into the token IDs a GPT consumes.

Where this sits in the curriculum.

  • The previous lesson (self-attention). This lesson assembles the single attention computation from there into multi-head attention and the full block. If the head-splitting felt fast, that lesson is the grounding for what each head does.

  • BatchNorm, WaveNet, and the backprop-ninja lessons (this track). The full GPT is a synthesis of the whole track: layer normalization is the per-token cousin of batch normalization; stacking blocks is the depth idea from WaveNet; residual connections lean on the fact that addition passes gradients through (autograd lesson); and training is the p - y cross-entropy gradient from the backprop-ninja lesson. This lesson is where the pieces come together.