Skip to content

Summary: becoming a backprop ninja

TL;DR. Until now the autograd engine computed every gradient and you trusted it. This lesson removes the safety net and backpropagates by hand, the same chain rule from lesson 1 (local derivative times incoming gradient, walked backward), applied to a real network. It walks the single most important derivation, softmax followed by cross-entropy, and lands on a famously clean result: the gradient on the logits is dL/dz_i = p_i - y_i, the predicted probability minus the true label. That is the exact signal that trains every classifier and language model. Owning it turns the engine from a black box into something you can reason about and debug.

  • By-hand backprop is the chain rule, not a new algorithm. Each operation has a local derivative; multiply the incoming gradient by it and walk backward. The engine automates this; doing it yourself for the key pieces is how you understand and debug gradient flow.

  • The gradient to know by heart is softmax-plus-cross-entropy. Every model ends with logits, then softmax into probabilities, then a cross-entropy (negative log likelihood) loss against the true class. The gradient on the logits is p_i - y_i.

  • p - y is just common sense made exact. Negative gradient on the correct class (descent pushes its logit up), positive on the wrong classes (pushed down), in proportion to misplaced probability mass. Worked once: logits [2, 1, 0] with class 0 correct give p = [0.665, 0.245, 0.090], loss 0.41, gradient [-0.335, 0.245, 0.090]; one descent step lowers the loss to 0.26.

  • Sanity checks a ninja uses. Softmax-plus-cross-entropy gradients sum to zero; a real gradient step should lower the loss; the starting loss should sit near the uniform baseline -log(1/V) = log(V). Each one catches a class of bug instantly.

  • This is the real training signal. Next-token prediction is softmax-plus-cross-entropy over the vocabulary, so “predicted probabilities minus the one-hot true token” is the number at the top of the backward pass on every training step of a model like GPT.

The backward pass stops being something a library does behind a curtain and becomes something you can compute, check, and fix. You can sanity-check a training setup, reason about what the loss is actually pushing a model toward (more probability on the truth, less on everything else), and catch the most common silent bug in machine learning, a wrong gradient, because you know what the right one should be. With the engine fully demystified, the next lesson returns to architecture and restructures the flat MLP into a deeper, hierarchical model in the style of WaveNet, so the network builds up its understanding in stages.