Cheatsheet: Backpropagation and the chain rule
The one idea that matters
Section titled “The one idea that matters”backpropagation = the chain rule, applied through the layers, run backwardthe "how much each knob should change" from L8 = a product of per-layer ratesTrack 8 teaches the chain rule; Track 11 applies it. You only need one line of it.
The chain rule in one line
Section titled “The chain rule in one line”df/dx = (df/dg) · (dg/dx)Rates multiply along a chain. (Nudge x → g moves by dg/dx → f moves by df/dg times that, so responses stack as a product.)
Why a chain at all
Section titled “Why a chain at all”The cost is nested, one layer inside the next: cost ← output ← weighted sum ← previous activations ← … ← input. A weight’s effect on the cost ripples forward through every layer, so its slope is the product of the rate at each layer.
Worked chain (one neuron per layer, no squish)
Section titled “Worked chain (one neuron per layer, no squish)”a0=1, w1=2, w2=3, w3=0.5, y=2. Forward: a1=2, a2=6, a3=3, C=(3−2)²=1.
dC/dw1 = (dC/da3)·(da3/da2)·(da2/da1)·(da1/dw1) = 2 · 0.5 · 3 · 1 = 3| Factor | Value | Meaning |
|---|---|---|
| dC/da3 = 2(a3−y) | 2 | the output’s desire (L8) |
| da3/da2 = w3 | 0.5 | desire flows back through a weight |
| da2/da1 = w2 | 3 | and back another layer |
| da1/dw1 = a0 | 1 | the input feeding this weight |
That 3 is one component of the gradient ∇C. Gradient descent (L7) then does w1_new = w1 − learning_rate · 3.
Why run it backward
Section titled “Why run it backward”dC/dw3 = 2 · 6 (2 factors)dC/dw2 = 2 · 0.5 · 2 (3 factors)dC/dw1 = 2 · 0.5 · 3 · 1 (4 factors)Every chain shares the output-side factors (dC/da3, then w3, …). Compute backward → calculate each shared factor once → one backward sweep yields every weight’s slope. That is what “backpropagation” names.
In a real network
Section titled “In a real network”- Longer chains: ~5 factors for 4 layers, ~100 for 100 layers. Same rule.
- Add the squish back: each layer contributes one extra factor (the activation’s slope). Rule unchanged.
- Many small factors multiplied can shrink toward zero → the “vanishing gradient” difficulty in very deep nets.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- “I must master calculus first.” No. One line: rates multiply. T8 has the depth.
- “Backprop is separate from the chain rule.” No. It is the chain rule, run backward through the layers.
- “Direction does not matter.” It does. Backward reuses shared output-side factors; forward recomputes them.
- “The squish breaks it.” No. Each squish adds one factor (its slope). Same method.
The one-line version
Section titled “The one-line version”Backpropagation is the chain rule walked backward through the layers, so the shared pieces are computed only once. Lesson 8’s wishes were derivatives all along.