Backpropagation: brief

What you’ll learn

Lesson 7 ended on a confession: gradient descent rests on having the gradient, the slope of the cost with respect to every one of the network’s roughly 13,000 knobs, and we waved our hands about where it comes from. This lesson pays that debt with pure intuition (the calculus is the next lesson). The method is called backpropagation.

You will see why the obvious approach (nudge each weight, re-run the network, measure the cost change) is hopeless, about 13,000 run-throughs per image. Then comes the reframe that unlocks everything: instead of asking how the cost depends on a buried weight, ask what each output neuron wants. You will learn the three ways to grant a neuron’s wish (raise its bias, raise weights on its bright inputs, or wish the previous layer were different), see why that third one can only be a wish that gets passed backward, and watch those wishes roll layer by layer to the front, which is literally what “backpropagation” names. The payoff: one forward pass plus one backward sweep yields the whole gradient for about the cost of running the network once. The lesson closes on why the real step averages wishes over many examples, so consistent patterns win and single-image quirks cancel.

Where this fits

This is lesson 8, the first of Phase 3 (How the gradient gets computed). Phase 2 built the learning loop but assumed the gradient was simply available; this lesson supplies the missing piece, intuitively. Lesson 9 then opens the hood on the “wishes,” showing they are derivatives and the backward sweep is the chain rule applied layer by layer (cross-referencing Track 8, Calculus). Lesson 10 zooms all the way out to assemble the whole mental model and point to where to go next. After this lesson, the training loop is conceptually complete.

Before you start

Prerequisite (within this track): lesson 7, Gradient descent, step by step, since this lesson answers the exact question lesson 7 left open (where the gradient comes from). It also leans on the lesson-3 neuron formula (activation = weighted sum plus bias, squished), so being comfortable that a neuron’s activation is built from weights, a bias, and the previous layer’s activations will make the “three ways to grant a wish” land. No calculus is needed here; that arrives in lesson 9.

By the end, you’ll be able to

Explain why computing the gradient by nudging each knob and re-running the network is hopelessly expensive
Describe the reframe of asking what each output neuron wants rather than how the cost depends on a buried weight
Explain the three ways to grant a neuron’s wish and why one of them can only be a wish for the previous layer
Explain how wishes propagate backward layer by layer, and that one forward plus one backward pass yields the whole gradient
Explain why the real training step averages wishes over many examples

Time and difficulty

Read time: about 11 minutes
Practice time: about 14 minutes (reading output neurons’ wishes, tracing a wish backward, and flashcards)
Difficulty: standard