Practice: What backpropagation is really doing
Self-check
Section titled “Self-check”Six short questions. Answer each one in your head (or on paper) before opening the collapsible. Trying to retrieve the answer is where the learning sticks; rereading feels productive but does much less.
1. Why can’t we get the gradient the obvious way (nudge each knob, re-run the network, see how the cost changed)?
Show answer
The count. There are about 13,000 knobs, so that is 13,000 full run-throughs of the network just for one training image, times tens of thousands of images. The arithmetic balloons into something hopeless. We need a way to get the slope for all 13,000 knobs at once, cheaply, which is what backpropagation provides.
2. What is the reframe that unlocks backpropagation?
Show answer
Instead of asking “how does the cost depend on this buried weight,” ask a friendlier question at the output end: what does each output neuron want? Each one has a desired direction (its activation should go up or down) and a strength of feeling. That list of desires carries the same information as the cost, just in a more actionable form.
3. There are three ways to push a neuron’s activation up. What are they?
Show answer
From the lesson-3 formula (activation = squish of weighted sum plus bias): (1) raise its bias (adds directly to the sum); (2) raise the weights on its already-bright inputs (a weight matters in proportion to the activation feeding through it); (3) raise the activations of the previous-layer neurons it connects to with positive weights. The first two are directly adjustable knobs; the third is only a wish.
4. How does an output neuron’s wish become the previous layer’s wishes? (This is the backward move.)
Show answer
A neuron cannot set earlier activations directly, it can only wish the previous layer handed it different numbers (more from its positive-weight inputs, less from its negative-weight ones). Every output neuron registers such wishes for the same previous-layer neurons, and summing all those competing requests gives each previous-layer neuron one net wish: be a bit higher or a bit lower. The output layer’s desires have become the previous layer’s desires.
5. What does a single forward pass plus a single backward pass give you, and why does that matter?
Show answer
The entire gradient: the desired nudge for every weight and every bias at once. The backward pass costs about the same as the forward pass, not 13,000 times more. That efficiency is the quiet miracle that makes training large networks feasible at all.
6. Why does the real training step average the wishes over many examples?
Show answer
Because one image’s wishes are self-serving: the “3” image alone would shove the weights toward “see everything as a 3.” Averaging wishes across many examples lets the pulls that many images agree on survive and add up, while the quirks only one image cares about cancel out. That averaged signal is the true gradient, which is why learning needs lots of data.
Try it yourself, part 1: read the wishes
Section titled “Try it yourself, part 1: read the wishes”About 5 minutes, no calculation. The image is actually a 7, so the desired output is a 1 in slot 7 and 0 everywhere else. The network’s ten output neurons come back as:
digit: 0 1 2 3 4 5 6 7 8 9output: 0.0 0.3 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.2For each neuron that has a meaningful wish, say which direction it wants to move (up or down) and roughly how strongly. Which two neurons have the strongest wishes?
Show answer
Compare each output to its desired value (1 for slot 7, 0 for the rest):
- Digit 7 is at 0.5 but should be 1, so it wants to go up, and fairly strongly (it is the correct answer sitting only halfway).
- Digit 1 is at 0.3 but should be 0, so it wants to go down, moderately (it is the loudest wrong digit).
- Digit 9 is at 0.2 but should be 0, so it wants to go down, a little.
- All the others are already at 0, where they should be, so they are content and have essentially no wish.
The two strongest wishes are digit 7 (up) and digit 1 (down): the correct answer that is too quiet and the wrong answer that is too loud. Those are exactly the desires backpropagation will push hardest on.
Try it yourself, part 2: trace a wish backward
Section titled “Try it yourself, part 2: trace a wish backward”About 5 minutes, no calculation. A hidden neuron H has received a net wish to move up. It has two incoming connections from the previous layer:
from neuron A: weight +0.7, A's current activation 0.8 (bright)from neuron B: weight -0.4, B's current activation 0.6Two questions: (1) Name the three ways H’s “go up” wish could be granted. (2) For the third way, what does H wish of neuron A, and what does it wish of neuron B?
Show answer
(1) Three ways to push H up:
- Raise
H’s bias (it adds directly to the weighted sum). - Raise the weight on its bright positive input, the weight from
A(+0.7), sinceAis already very active (0.8), so increasing that weight gives the most lift per unit of change. - Wish the previous layer were different, since
Hcannot setAandBdirectly.
(2) H’s wishes for the previous layer: A feeds in through a positive weight (+0.7), so making A more active would raise H, meaning H wishes A were more active (up). B feeds in through a negative weight (-0.4), so making B more active would lower H, meaning H wishes B were less active (down). Those two wishes, summed with the wishes of every other neuron that connects to A and B, become A’s and B’s own net desires, and the backward roll continues to the layer behind them.
Flashcards
Section titled “Flashcards”Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.
Q. What does backpropagation compute, and is it the same as training?
It computes the gradient, every knob’s wish, in one backward sweep. It is not training by itself: backprop finds the gradient, then gradient descent (lesson 7) uses it to take the step. Backprop is one ingredient in the loop.
Q. Why is the brute-force gradient (nudge each knob, re-run) hopeless?
It would cost about 13,000 full forward passes per image (one per knob), times tens of thousands of images. Backpropagation instead gets all ~13,000 wishes at once, for about the cost of one extra pass.
Q. What is the reframe at the heart of backpropagation?
Instead of “how does the cost depend on this buried weight,” ask “what does each output neuron want?” Each has a desired direction and strength, which is the same information as the cost in a more actionable form.
Q. What are the three ways to push a neuron's activation up?
Raise its bias; raise the weights on its already-bright inputs; or raise the activations of the previous-layer neurons it connects to with positive weights. The first two are adjustable knobs; the third is only a wish.
Q. How does a neuron's wish become the previous layer's wishes?
A neuron cannot set earlier activations directly, so it wishes the previous layer were different (more from positive-weight inputs, less from negative). Summing all neurons’ competing wishes gives each previous-layer neuron one net wish. That is the backward propagation.
Q. What does one forward pass plus one backward pass give you?
The entire gradient: every weight’s and bias’s wished-for nudge at once, for about the cost of running the network once (not 13,000 times). That efficiency is what makes training large networks possible.
Q. Why average the wishes over many examples?
One image’s wishes are self-serving and noisy (a “3” wants “see everything as 3”). Averaging across many examples lets consistent pulls survive and add up while quirks cancel. That averaged signal is the true gradient, so learning needs lots of data.
Q. Which direction does the gradient computation flow, and which direction is the forward pass?
The forward pass flows input to output (computing the answer). Backpropagation flows backward, output to input (computing the gradient), because desires start where the cost is felt and propagate back layer by layer.
Q. What is a trained model, in light of backpropagation?
Not designed but settled into: its parameters are the accumulated residue of countless small wishes, averaged over mountains of examples and applied over and over until the cost stopped falling. There is no author of those billions of numbers.