Skip to content

What backpropagation is really doing

Lesson 7 ended on a confession: gradient descent rests on having the gradient, the slope of the cost with respect to every one of the network’s roughly 13,000 knobs, and we waved our hands about where it comes from. This lesson pays that debt with pure intuition (the calculus is the next lesson). The method is called backpropagation.

You will see why the obvious approach (nudge each weight, re-run the network, measure the cost change) is hopeless, about 13,000 run-throughs per image. Then comes the reframe that unlocks everything: instead of asking how the cost depends on a buried weight, ask what each output neuron wants. You will learn the three ways to grant a neuron’s wish (raise its bias, raise weights on its bright inputs, or wish the previous layer were different), see why that third one can only be a wish that gets passed backward, and watch those wishes roll layer by layer to the front, which is literally what “backpropagation” names. The payoff: one forward pass plus one backward sweep yields the whole gradient for about the cost of running the network once. The lesson closes on why the real step averages wishes over many examples, so consistent patterns win and single-image quirks cancel.

This is lesson 8, the first of Phase 3 (How the gradient gets computed). Phase 2 built the learning loop but assumed the gradient was simply available; this lesson supplies the missing piece, intuitively. Lesson 9 then opens the hood on the “wishes,” showing they are derivatives and the backward sweep is the chain rule applied layer by layer (cross-referencing Track 8, Calculus). Lesson 10 zooms all the way out to assemble the whole mental model and point to where to go next. After this lesson, the training loop is conceptually complete.

Prerequisite (within this track): lesson 7, Gradient descent, step by step, since this lesson answers the exact question lesson 7 left open (where the gradient comes from). It also leans on the lesson-3 neuron formula (activation = weighted sum plus bias, squished), so being comfortable that a neuron’s activation is built from weights, a bias, and the previous layer’s activations will make the “three ways to grant a wish” land. No calculus is needed here; that arrives in lesson 9.

  • Explain why computing the gradient by nudging each knob and re-running the network is hopelessly expensive
  • Describe the reframe of asking what each output neuron wants rather than how the cost depends on a buried weight
  • Explain the three ways to grant a neuron’s wish and why one of them can only be a wish for the previous layer
  • Explain how wishes propagate backward layer by layer, and that one forward plus one backward pass yields the whole gradient
  • Explain why the real training step averages wishes over many examples
  • Read time: about 11 minutes
  • Practice time: about 14 minutes (reading output neurons’ wishes, tracing a wish backward, and flashcards)
  • Difficulty: standard