Skip to content

Summary: Backpropagation and the chain rule

Lesson 8 kept saying backprop figures out how much each knob should change, without ever computing that number. This lesson names the how-much: it is calculus, specifically the chain rule, applied through the layers. The headline is that lesson 8’s backward flow of “desires” was the chain rule all along, just told as a story. You only need one line of calculus and the willingness to multiply a few small numbers; Track 8 builds the chain rule properly if you want it. This is the scan-it-in-five-minutes version.

  • The chain rule, in one line: df/dx = (df/dg) · (dg/dx). When one function feeds into another, the overall rate of change is the product of the rates at each step. Rates multiply along a chain.
  • The cost is a deeply nested function. Cost depends on the output, which depends on the last weighted sum, which depends on the previous activations, and so on back to the input, one nesting per layer. A wiggle in an early weight ripples through every layer to the cost, so the chain rule is exactly the tool for that ripple.
  • Worked small (no squish): a one-neuron-per-layer chain with a0=1, w1=2, w2=3, w3=0.5, y=2 gives a forward pass of a1=2, a2=6, a3=3, cost 1. Then dC/dw1 = (dC/da3)·(da3/da2)·(da2/da1)·(da1/dw1) = 2 · 0.5 · 3 · 1 = 3.
  • This is lesson 8 in numbers. The first factor dC/da3 = 2 is the output’s desire; each weight factor is that desire flowing back a layer; the last factor a0 is the input feeding the weight. The story and the arithmetic are the same thing, and that 3 is one component of the gradient ∇C that gradient descent then steps against.
  • Run it backward to reuse work. The chains for dC/dw3, dC/dw2, dC/dw1 all share their output-side factors (dC/da3, then w3, …). Computing from the output backward calculates each shared factor once, so a single backward sweep yields every weight’s slope. That is precisely what “backpropagation” names.
  • Real networks and the squish. Deeper networks just mean longer chains (about 100 factors for 100 layers), same rule. Putting the squish back adds one factor per layer (the activation’s slope). And when a long chain’s factors are each below 1, their product shrinks toward zero, the vanishing gradient problem that front layers of deep networks suffer.

The chain rule is quietly why deep learning works at all: “deep” means many layers, which means long chains, and the chain rule is the only thing that makes the gradient through all those layers computable at a sane cost. Every deep model you have used was trained by running exactly this, chains of rates multiplied backward through dozens or hundreds of layers, billions of times. It even shows you where a famous difficulty lives: stack many factors each below 1 and the product vanishes, so the front layers of a very deep network can stop learning, which is why so much architecture design exists to keep those backward chains healthy. With this, the training loop is complete end to end. The final lesson pulls the whole journey together, from a messy handwritten 3 to a trained network, and points you toward building one yourself.