Skip to content

Cheatsheet: Backpropagation and the chain rule

backpropagation = the chain rule, applied through the layers, run backward
the "how much each knob should change" from L8 = a product of per-layer rates

Track 8 teaches the chain rule; Track 11 applies it. You only need one line of it.

df/dx = (df/dg) · (dg/dx)

Rates multiply along a chain. (Nudge x → g moves by dg/dx → f moves by df/dg times that, so responses stack as a product.)

The cost is nested, one layer inside the next: cost ← output ← weighted sum ← previous activations ← … ← input. A weight’s effect on the cost ripples forward through every layer, so its slope is the product of the rate at each layer.

Worked chain (one neuron per layer, no squish)

Section titled “Worked chain (one neuron per layer, no squish)”

a0=1, w1=2, w2=3, w3=0.5, y=2. Forward: a1=2, a2=6, a3=3, C=(3−2)²=1.

dC/dw1 = (dC/da3)·(da3/da2)·(da2/da1)·(da1/dw1)
= 2 · 0.5 · 3 · 1 = 3
FactorValueMeaning
dC/da3 = 2(a3−y)2the output’s desire (L8)
da3/da2 = w30.5desire flows back through a weight
da2/da1 = w23and back another layer
da1/dw1 = a01the input feeding this weight

That 3 is one component of the gradient ∇C. Gradient descent (L7) then does w1_new = w1 − learning_rate · 3.

dC/dw3 = 2 · 6 (2 factors)
dC/dw2 = 2 · 0.5 · 2 (3 factors)
dC/dw1 = 2 · 0.5 · 3 · 1 (4 factors)

Every chain shares the output-side factors (dC/da3, then w3, …). Compute backward → calculate each shared factor once → one backward sweep yields every weight’s slope. That is what “backpropagation” names.

  • Longer chains: ~5 factors for 4 layers, ~100 for 100 layers. Same rule.
  • Add the squish back: each layer contributes one extra factor (the activation’s slope). Rule unchanged.
  • Many small factors multiplied can shrink toward zero → the “vanishing gradient” difficulty in very deep nets.
  • “I must master calculus first.” No. One line: rates multiply. T8 has the depth.
  • “Backprop is separate from the chain rule.” No. It is the chain rule, run backward through the layers.
  • “Direction does not matter.” It does. Backward reuses shared output-side factors; forward recomputes them.
  • “The squish breaks it.” No. Each squish adds one factor (its slope). Same method.

Backpropagation is the chain rule walked backward through the layers, so the shared pieces are computed only once. Lesson 8’s wishes were derivatives all along.