Backpropagation: cheatsheet

The one idea that matters

backpropagation = compute the gradient (every knob's wish) in one backward sweep
                  by passing "what each layer wants" back to the layer before it

Backprop is not training. It is the gradient-computation step that gradient descent (L7) then uses.

Why not brute force

Nudging each knob and re-running the network would cost ~13,000 forward passes per image. Hopeless. Backprop gets all ~13,000 wishes at once, for about the cost of one extra pass.

The intuition, in four moves

Read the output’s wishes. Each output neuron wants to move toward its desired value. (Image is “3”: the “3” neuron wants up, an over-firing “4” wants down, the rest are roughly content.)
Three ways to grant a wish (from the L3 neuron formula): raise the bias, raise weights on already-bright inputs, or wish the previous layer were more/less active.
The third wish rolls backward. A neuron cannot set earlier activations directly, so its wish becomes a request to the previous layer. Summed over all neurons, each previous-layer neuron gets one net wish.
Repeat to the front. Those become the previous layer’s desires, propagated backward layer by layer. The literal meaning of “backpropagation.”

The full training loop

1. Forward pass     run an image through → output
2. Cost             compare output to desired answer
3. Backward pass    backprop → every knob's wish (the gradient)
4. Step             nudge every knob per the gradient (gradient descent)
5. Next example, repeat

The averaging effect

One image’s wishes are self-serving (a “3” wants “see everything as 3”). The real step averages wishes over many examples: consistent pulls survive, quirks cancel. This is why learning needs lots of data.

Why it is efficient

Approach	Cost to get the gradient
Nudge each knob, re-run	~13,000 forward passes per image
Backpropagation	1 forward + 1 backward pass (~2x one pass)

That efficiency is what makes training large networks possible at all.

Pitfalls to dodge

“Backprop is the learning.” No. It computes the gradient; gradient descent takes the step.
“One example trains the network.” No. Wishes are averaged over many; consistent patterns win.
“The flow is forward.” No. Output is forward; the gradient is computed backward.
“Neurons set earlier activations.” No. They adjust their own bias and weights and wish the previous layer were different. That wish is what propagates back.

The one-line version

Backpropagation is the network passing the blame backward: each layer tells the one before it how it wishes things had been different, in a single sweep.