Becoming a backprop ninja: gradients by hand
What you’ll learn
Section titled “What you’ll learn”This is lesson 4 of Phase 2 (Building a language model) in the Build Neural Networks from Scratch track, which follows the arc of Andrej Karpathy’s Neural Networks: Zero to Hero series. Every gradient so far came from the autograd engine; you called backward() and trusted the result. This lesson removes the safety net.
You will backpropagate by hand, which is just the chain rule from lesson 1 (each operation’s local derivative times the incoming gradient, walked backward) applied to a real network instead of toy nodes. The lesson walks the single most valuable derivation, the softmax-plus-cross-entropy output every classifier and language model shares, and arrives at one of the cleanest results in the field: the gradient on the logits is dL/dz_i = p_i - y_i, the predicted probability minus the true label. It works the derivation on numbers (logits [2, 1, 0], correct class 0, giving gradient [-0.335, 0.245, 0.090]), verifies it with the sum-to-zero check and a loss-reducing gradient step, and shows this is the exact signal that trains a model like GPT. The practice then has you do a second manual backward pass, through a tanh neuron and its linear step, yourself.
Where this fits
Section titled “Where this fits”This is lesson 4 of Phase 2, Building a language model. The source lecture is an exercise (backprop through the whole MLP by hand); we mirror its arc as a reading lesson by walking one representative derivation and leaving a second for the practice. It is the capstone of the engine half of the track: lesson 1 built the autograd engine, and this lesson shows you can be the engine, computing the gradients yourself. The next lesson returns to architecture, restructuring the flat MLP into a deeper, hierarchical model in the style of WaveNet.
Before you start
Section titled “Before you start”Prerequisites (within this track): lesson 1, Building an autograd engine: micrograd (this lesson is that lesson done by hand: the same local derivatives, add passes through, mul swaps inputs, tanh gives 1 - tanh^2, chained backward), and lesson 4, Giving the model memory: the MLP language model (the network whose backward pass this opens). You should be comfortable that backward() multiplies each incoming gradient by a local derivative, and know that the model ends in a softmax and a negative-log-likelihood loss. If those read as familiar, you are ready. No coding is required to follow the lesson, though the makemore Part 4 notebook (MIT-licensed) is the full hands-on version of the exercise.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Explain what backpropagating by hand means and how it is the same chain rule the autograd engine runs
- State and interpret the softmax-plus-cross-entropy gradient on the logits,
dL/dz = p - y(predicted minus true) - Derive and verify that gradient on a small worked example, including the sum-to-zero sanity check and a loss-reducing gradient step
- Backpropagate by hand through a
tanhneuron and a linear step, chaining their local derivatives - Recognize that
p - yis the exact training signal of every classifier and large language model, and why owning it lets you debug training
Time and difficulty
Section titled “Time and difficulty”- Read time: about 12 minutes
- Practice time: about 20 minutes (a manual backward pass through a
tanhneuron and its linear step, optionally the full makemore notebook, plus flashcards) - Difficulty: standard