Skip to content

Backpropagation and the chain rule

Lesson 8 gave the intuition for backpropagation, desires propagating backward, but stayed deliberately vague about one word: it kept saying the network figures out how much each weight should change without ever computing that number. This lesson names the how-much. It is calculus, specifically the chain rule, applied through the layers, and by the end you will see that lesson 8’s backward flow of desires was the chain rule all along, just told as a story.

This is the most math-leaning lesson in the track, so a promise up front: we use the chain rule, we do not teach it from scratch (Track 8, Visual Math: Calculus, does that). You only need its one-line summary, df/dx = (df/dg)·(dg/dx), rates multiply along a chain, and the willingness to multiply a few small numbers. You will see why the cost is a deeply nested function (one nesting per layer), work the smallest chain by hand (for a one-neuron-per-layer network, dC/dw1 comes out as the product of four simple factors, 2 · 0.5 · 3 · 1 = 3), and recognize the first factor as the output neuron’s desire from lesson 8. Then you will see why running the chains backward lets shared output-side factors be computed once (so one sweep yields the whole gradient), how the squish just adds one factor per layer, and where the vanishing gradient problem comes from: long chains of small factors multiplying toward zero.

This is lesson 9, the second of Phase 3 (How the gradient gets computed) and the lesson that makes the whole training loop precise. Lesson 8 told the backprop story; this lesson supplies the arithmetic underneath it, closing the loop that lessons 5 through 7 set up (cost, landscape, gradient descent). It cross-references Track 8 (Visual Math: Calculus) for the chain rule itself. Lesson 10 then steps all the way back to assemble the entire mental model, from a handwritten 3 to a trained network, and points you toward building one.

Prerequisite (within this track): lesson 8, What backpropagation is really doing, since this lesson puts numbers on the “desires” and “wishes” it described. Helpful but not required: the chain rule, covered in Track 8 (Visual Math: Calculus); if “rates multiply along a chain” is already familiar, the worked example will feel like review, and if not, the one-line version in the lesson is enough to follow along. The arithmetic is multiplying small numbers; a calculator is optional.

  • State the chain rule in one line and explain why rates multiply along a chain
  • Explain why the cost is a nested function with one nesting per layer, making the chain rule the right tool
  • Work a small chain by hand to compute a weight’s effect on the cost as a product of per-layer rates
  • Explain how the chain-rule product is lesson 8’s backward flow of desires, and why running it backward reuses shared factors
  • Recognize that adding the activation function adds one factor per layer, and that long chains of small factors cause the vanishing gradient
  • Read time: about 12 minutes
  • Practice time: about 15 minutes (working a chain by hand and taking a gradient-descent step, a vanishing-gradient computation, and flashcards)
  • Difficulty: standard