References: assembling and training the full GPT

Source material

Source curriculum (structural mirror, cited as further study):
• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 7:
  "Let's build GPT: from scratch, in code, spelled out."
  Creator: Andrej Karpathy
  Video: https://www.youtube.com/watch?v=kCc8FmEb1nY
  Code repo (nanoGPT): https://github.com/karpathy/nanoGPT (MIT License)
  Companion repo (ng-video-lecture): https://github.com/karpathy/ng-video-lecture (no explicit license)
  Series page: https://karpathy.ai/zero-to-hero.html
  License: nanoGPT is MIT-licensed; the ng-video-lecture companion repo carries no explicit license; the video is YouTube standard.
This lesson covers the second half of Lecture 7, where Karpathy assembles
self-attention into multi-head attention, the transformer block (with residual
connections and layer normalization), positional embeddings, and the full GPT,
then trains it on character-level Shakespeare. Clawdemy's lessons are original
prose following the pedagogical arc of this series; we do not reproduce or
transcribe the video or code. The multi-head dimension split and the
shape-trace example here are ours. All rights to the original video and code
remain with the creator.

Watch this next

Let’s build GPT: from scratch, in code, spelled out (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. After deriving self-attention (the previous lesson), Karpathy builds multi-head attention, adds the feed-forward layer, wraps both in residual connections and layer normalization, adds positional embeddings, stacks the blocks into the full GPT, and trains it until it generates Shakespeare-flavored text. Watching the loss fall and the samples sharpen as each piece is added, residuals, layer norm, more blocks, makes every part’s contribution visible.

Going deeper

nanoGPT on GitHub (MIT License). Karpathy’s clean, minimal GPT, the production-quality version of what the lecture builds. Reading model.py after this lesson is the fastest way to confirm that a GPT really is just embeddings, a stack of blocks, a final norm, and a softmax head.
Attention Is All You Need (Vaswani et al., 2017) (arXiv). The paper that introduced the transformer block, multi-head attention, and positional encodings this lesson assembles. The architecture you just built is a decoder-only variant of the one in this paper.
Neural Networks: Zero to Hero (full series) by Andrej Karpathy. The final lecture builds the tokenizer, the piece that turns raw text into the token IDs a GPT consumes.

Adjacent topics

Where this sits in the curriculum.

The previous lesson (self-attention). This lesson assembles the single attention computation from there into multi-head attention and the full block. If the head-splitting felt fast, that lesson is the grounding for what each head does.
BatchNorm, WaveNet, and the backprop-ninja lessons (this track). The full GPT is a synthesis of the whole track: layer normalization is the per-token cousin of batch normalization; stacking blocks is the depth idea from WaveNet; residual connections lean on the fact that addition passes gradients through (autograd lesson); and training is the p - y cross-entropy gradient from the backprop-ninja lesson. This lesson is where the pieces come together.