Backpropagation: cheatsheet

The one idea that matters

backpropagation = the chain rule, applied through the layers, run backward
the "how much each knob should change" from L8 = a product of per-layer rates

Track 8 teaches the chain rule; Track 11 applies it. You only need one line of it.

The chain rule in one line

df/dx = (df/dg) · (dg/dx)

Rates multiply along a chain. (Nudge x → g moves by dg/dx → f moves by df/dg times that, so responses stack as a product.)

Why a chain at all

The cost is nested, one layer inside the next: cost ← output ← weighted sum ← previous activations ← … ← input. A weight’s effect on the cost ripples forward through every layer, so its slope is the product of the rate at each layer.

Worked chain (one neuron per layer, no squish)

a0=1, w1=2, w2=3, w3=0.5, y=2. Forward: a1=2, a2=6, a3=3, C=(3−2)²=1.

dC/dw1 = (dC/da3)·(da3/da2)·(da2/da1)·(da1/dw1)
       =    2    ·   0.5   ·    3    ·    1     = 3

Factor	Value	Meaning
dC/da3 = 2(a3−y)	2	the output’s desire (L8)
da3/da2 = w3	0.5	desire flows back through a weight
da2/da1 = w2	3	and back another layer
da1/dw1 = a0	1	the input feeding this weight

That 3 is one component of the gradient ∇C. Gradient descent (L7) then does w1_new = w1 − learning_rate · 3.

Why run it backward

dC/dw3 = 2 · 6           (2 factors)
dC/dw2 = 2 · 0.5 · 2     (3 factors)
dC/dw1 = 2 · 0.5 · 3 · 1 (4 factors)

Every chain shares the output-side factors (dC/da3, then w3, …). Compute backward → calculate each shared factor once → one backward sweep yields every weight’s slope. That is what “backpropagation” names.

In a real network

Longer chains: ~5 factors for 4 layers, ~100 for 100 layers. Same rule.
Add the squish back: each layer contributes one extra factor (the activation’s slope). Rule unchanged.
Many small factors multiplied can shrink toward zero → the “vanishing gradient” difficulty in very deep nets.

Pitfalls to dodge

“I must master calculus first.” No. One line: rates multiply. T8 has the depth.
“Backprop is separate from the chain rule.” No. It is the chain rule, run backward through the layers.
“Direction does not matter.” It does. Backward reuses shared output-side factors; forward recomputes them.
“The squish breaks it.” No. Each squish adds one factor (its slope). Same method.

The one-line version

Backpropagation is the chain rule walked backward through the layers, so the shared pieces are computed only once. Lesson 8’s wishes were derivatives all along.