Skip to content

Cheatsheet: What backpropagation is really doing

backpropagation = compute the gradient (every knob's wish) in one backward sweep
by passing "what each layer wants" back to the layer before it

Backprop is not training. It is the gradient-computation step that gradient descent (L7) then uses.

Nudging each knob and re-running the network would cost ~13,000 forward passes per image. Hopeless. Backprop gets all ~13,000 wishes at once, for about the cost of one extra pass.

  1. Read the output’s wishes. Each output neuron wants to move toward its desired value. (Image is “3”: the “3” neuron wants up, an over-firing “4” wants down, the rest are roughly content.)
  2. Three ways to grant a wish (from the L3 neuron formula): raise the bias, raise weights on already-bright inputs, or wish the previous layer were more/less active.
  3. The third wish rolls backward. A neuron cannot set earlier activations directly, so its wish becomes a request to the previous layer. Summed over all neurons, each previous-layer neuron gets one net wish.
  4. Repeat to the front. Those become the previous layer’s desires, propagated backward layer by layer. The literal meaning of “backpropagation.”
1. Forward pass run an image through → output
2. Cost compare output to desired answer
3. Backward pass backprop → every knob's wish (the gradient)
4. Step nudge every knob per the gradient (gradient descent)
5. Next example, repeat

One image’s wishes are self-serving (a “3” wants “see everything as 3”). The real step averages wishes over many examples: consistent pulls survive, quirks cancel. This is why learning needs lots of data.

ApproachCost to get the gradient
Nudge each knob, re-run~13,000 forward passes per image
Backpropagation1 forward + 1 backward pass (~2x one pass)

That efficiency is what makes training large networks possible at all.

  • “Backprop is the learning.” No. It computes the gradient; gradient descent takes the step.
  • “One example trains the network.” No. Wishes are averaged over many; consistent patterns win.
  • “The flow is forward.” No. Output is forward; the gradient is computed backward.
  • “Neurons set earlier activations.” No. They adjust their own bias and weights and wish the previous layer were different. That wish is what propagates back.

Backpropagation is the network passing the blame backward: each layer tells the one before it how it wishes things had been different, in a single sweep.