Backprop by hand: brief

What you’ll learn

This is lesson 4 of Phase 2 (Building a language model) in the Build Neural Networks from Scratch track, which follows the arc of Andrej Karpathy’s Neural Networks: Zero to Hero series. Every gradient so far came from the autograd engine; you called backward() and trusted the result. This lesson removes the safety net.

You will backpropagate by hand, which is just the chain rule from lesson 1 (each operation’s local derivative times the incoming gradient, walked backward) applied to a real network instead of toy nodes. The lesson walks the single most valuable derivation, the softmax-plus-cross-entropy output every classifier and language model shares, and arrives at one of the cleanest results in the field: the gradient on the logits is dL/dz_i = p_i - y_i, the predicted probability minus the true label. It works the derivation on numbers (logits [2, 1, 0], correct class 0, giving gradient [-0.335, 0.245, 0.090]), verifies it with the sum-to-zero check and a loss-reducing gradient step, and shows this is the exact signal that trains a model like GPT. The practice then has you do a second manual backward pass, through a tanh neuron and its linear step, yourself.

Where this fits

This is lesson 4 of Phase 2, Building a language model. The source lecture is an exercise (backprop through the whole MLP by hand); we mirror its arc as a reading lesson by walking one representative derivation and leaving a second for the practice. It is the capstone of the engine half of the track: lesson 1 built the autograd engine, and this lesson shows you can be the engine, computing the gradients yourself. The next lesson returns to architecture, restructuring the flat MLP into a deeper, hierarchical model in the style of WaveNet.

Before you start

Prerequisites (within this track): lesson 1, Building an autograd engine: micrograd (this lesson is that lesson done by hand: the same local derivatives, add passes through, mul swaps inputs, tanh gives 1 - tanh^2, chained backward), and lesson 4, Giving the model memory: the MLP language model (the network whose backward pass this opens). You should be comfortable that backward() multiplies each incoming gradient by a local derivative, and know that the model ends in a softmax and a negative-log-likelihood loss. If those read as familiar, you are ready. No coding is required to follow the lesson, though the makemore Part 4 notebook (MIT-licensed) is the full hands-on version of the exercise.

By the end, you’ll be able to

Explain what backpropagating by hand means and how it is the same chain rule the autograd engine runs
State and interpret the softmax-plus-cross-entropy gradient on the logits, dL/dz = p - y (predicted minus true)
Derive and verify that gradient on a small worked example, including the sum-to-zero sanity check and a loss-reducing gradient step
Backpropagate by hand through a tanh neuron and a linear step, chaining their local derivatives
Recognize that p - y is the exact training signal of every classifier and large language model, and why owning it lets you debug training

Time and difficulty

Read time: about 12 minutes
Practice time: about 20 minutes (a manual backward pass through a tanh neuron and its linear step, optionally the full makemore notebook, plus flashcards)
Difficulty: standard