Skip to content

References: becoming a backprop ninja

Source curriculum (structural mirror, cited as further study):
• Andrej Karpathy, "Neural Networks: Zero to Hero", Lecture 5:
"Building makemore Part 4: Becoming a Backprop Ninja"
Creator: Andrej Karpathy
Video: https://www.youtube.com/watch?v=q8SA3rM6ckI
Code repo (makemore): https://github.com/karpathy/makemore (MIT License)
Series repo: https://github.com/karpathy/nn-zero-to-hero (MIT License)
Series page: https://karpathy.ai/zero-to-hero.html
License: makemore and the series code are MIT-licensed; the video is YouTube standard.
This lesson covers Lecture 5, an exercise lecture in which Karpathy
backpropagates through the entire MLP language model by hand, without the
autograd engine. We mirror its pedagogical arc as a reading lesson: we walk the
single most instructive derivation (softmax + cross-entropy) and leave a second
one for the practice. Clawdemy's lessons are original prose; we do not reproduce
or transcribe the video or code. The worked derivations and numbers here are
ours, built to be checkable by hand. All rights to the original video and code
remain with the creator.
  • Building makemore Part 4: Becoming a Backprop Ninja (Andrej Karpathy) by Andrej Karpathy. The lecture this lesson mirrors. It is built as a guided exercise: Karpathy backpropagates by hand through every operation of the MLP language model, in stages, checking each gradient against the autograd engine’s answer. If this lesson made you want to do the full thing, the lecture (and its companion notebook) is exactly that, the most hands-on episode in the series. Doing it yourself, gradient by gradient, is how “backprop ninja” stops being a phrase and becomes a skill.
  • makemore on GitHub (MIT License). The Part 4 notebook contains the manual-backprop exercises with cells to fill in and automatic checks against the engine, the practice version of this lesson, at full scale.

  • Neural Networks: Zero to Hero (full series) and its code repo by Andrej Karpathy. The next lecture returns to architecture and restructures the MLP into a deeper, hierarchical model in the style of WaveNet.

Where this sits in the curriculum.

  • The autograd engine (lesson 1). This lesson is lesson 1 done by hand for a real network: the same local derivatives (add passes through, mul swaps inputs, tanh gives 1 - tanh^2) chained backward. Lesson 1 built the machine; this lesson shows you can be the machine.

  • The bigram and MLP language models (lessons 3 and 4). The softmax-plus-cross-entropy gradient derived here is exactly the loss those models train on, so this lesson opens the lid on the backward pass you have been running since the bigram model. The p - y result explains, concretely, what every training step has been doing.