Cheatsheet: The chain rule
The rule
Section titled “The rule”d/dx( f(g(x)) ) = f'(g(x)) · g'(x)Outer derivative (evaluated at the inner function) times inner derivative.
The intuition: rates multiply through a pipeline
Section titled “The intuition: rates multiply through a pipeline”x --g--> u = g(x) --f--> f(u)Nudging x changes u at rate g'(x), and that change drives the output at rate f'(u). The rates compound: total rate = f'(g(x)) · g'(x). Each nested layer contributes one multiplying factor.
The gotcha: “evaluated at”
Section titled “The gotcha: “evaluated at””The outer derivative is f'(g(x)), not f'(x). For sin(x^2) it is cos(x^2), not cos(x), because sine acts on x^2. This is the number-one chain-rule error.
Worked examples
Section titled “Worked examples”| Composition | Outer ’ / inner ‘ | Chain rule | Result |
|---|---|---|---|
(3x+1)^2 | 2u / 3 | 2(3x+1)·3 | 6(3x+1) |
sin(x^2) | cos u / 2x | cos(x^2)·2x | 2x·cos(x^2) |
sin(cos x) | cos u / -sin x | cos(cos x)·(-sin x) | -sin(x)·cos(cos x) |
e^(2x) | e^u / 2 | e^(2x)·2 | 2e^(2x) (preview of L7) |
(3x+1)^2 checks out: expand to 9x^2+6x+1, derivative 18x+6 = 6(3x+1).
Why it matters for AI (the big one)
Section titled “Why it matters for AI (the big one)”A neural network is a deep composition of layers. The derivative of the loss with respect to a buried parameter is the chain rule applied layer by layer, which is exactly backpropagation: rates (gradients) multiply backward through the layers. Every framework (PyTorch, TensorFlow, JAX) implements this as automatic differentiation over the computation graph. It is the most-used calculus rule in ML, applied an astronomical number of times per training step. (Vanishing/exploding gradients = many chain-rule factors multiplied, collapsing below 1 or growing above it.)
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Forgetting “evaluated at.” Outer derivative at the inner function:
cos(x^2), notcos(x). - Dropping the inner derivative. Two factors; do not stop after the outer one.
- Confusing with the product rule. Multiplied -> sum of two terms; nested -> product of rates.
- Stopping early on deep nests. One factor per layer; peel outside in.
The one-line version
Section titled “The one-line version”When functions nest, their rates multiply: one factor per layer, each outer derivative read at the layer below, which is exactly what backpropagation does through a network.