Cheatsheet: What backpropagation is really doing
The one idea that matters
Section titled “The one idea that matters”backpropagation = compute the gradient (every knob's wish) in one backward sweep by passing "what each layer wants" back to the layer before itBackprop is not training. It is the gradient-computation step that gradient descent (L7) then uses.
Why not brute force
Section titled “Why not brute force”Nudging each knob and re-running the network would cost ~13,000 forward passes per image. Hopeless. Backprop gets all ~13,000 wishes at once, for about the cost of one extra pass.
The intuition, in four moves
Section titled “The intuition, in four moves”- Read the output’s wishes. Each output neuron wants to move toward its desired value. (Image is “3”: the “3” neuron wants up, an over-firing “4” wants down, the rest are roughly content.)
- Three ways to grant a wish (from the L3 neuron formula): raise the bias, raise weights on already-bright inputs, or wish the previous layer were more/less active.
- The third wish rolls backward. A neuron cannot set earlier activations directly, so its wish becomes a request to the previous layer. Summed over all neurons, each previous-layer neuron gets one net wish.
- Repeat to the front. Those become the previous layer’s desires, propagated backward layer by layer. The literal meaning of “backpropagation.”
The full training loop
Section titled “The full training loop”1. Forward pass run an image through → output2. Cost compare output to desired answer3. Backward pass backprop → every knob's wish (the gradient)4. Step nudge every knob per the gradient (gradient descent)5. Next example, repeatThe averaging effect
Section titled “The averaging effect”One image’s wishes are self-serving (a “3” wants “see everything as 3”). The real step averages wishes over many examples: consistent pulls survive, quirks cancel. This is why learning needs lots of data.
Why it is efficient
Section titled “Why it is efficient”| Approach | Cost to get the gradient |
|---|---|
| Nudge each knob, re-run | ~13,000 forward passes per image |
| Backpropagation | 1 forward + 1 backward pass (~2x one pass) |
That efficiency is what makes training large networks possible at all.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- “Backprop is the learning.” No. It computes the gradient; gradient descent takes the step.
- “One example trains the network.” No. Wishes are averaged over many; consistent patterns win.
- “The flow is forward.” No. Output is forward; the gradient is computed backward.
- “Neurons set earlier activations.” No. They adjust their own bias and weights and wish the previous layer were different. That wish is what propagates back.
The one-line version
Section titled “The one-line version”Backpropagation is the network passing the blame backward: each layer tells the one before it how it wishes things had been different, in a single sweep.