Skip to content

Cheatsheet: The chain rule

d/dx( f(g(x)) ) = f'(g(x)) · g'(x)

Outer derivative (evaluated at the inner function) times inner derivative.

The intuition: rates multiply through a pipeline

Section titled “The intuition: rates multiply through a pipeline”
x --g--> u = g(x) --f--> f(u)

Nudging x changes u at rate g'(x), and that change drives the output at rate f'(u). The rates compound: total rate = f'(g(x)) · g'(x). Each nested layer contributes one multiplying factor.

The outer derivative is f'(g(x)), not f'(x). For sin(x^2) it is cos(x^2), not cos(x), because sine acts on x^2. This is the number-one chain-rule error.

CompositionOuter ’ / inner ‘Chain ruleResult
(3x+1)^22u / 32(3x+1)·36(3x+1)
sin(x^2)cos u / 2xcos(x^2)·2x2x·cos(x^2)
sin(cos x)cos u / -sin xcos(cos x)·(-sin x)-sin(x)·cos(cos x)
e^(2x)e^u / 2e^(2x)·22e^(2x) (preview of L7)

(3x+1)^2 checks out: expand to 9x^2+6x+1, derivative 18x+6 = 6(3x+1).

A neural network is a deep composition of layers. The derivative of the loss with respect to a buried parameter is the chain rule applied layer by layer, which is exactly backpropagation: rates (gradients) multiply backward through the layers. Every framework (PyTorch, TensorFlow, JAX) implements this as automatic differentiation over the computation graph. It is the most-used calculus rule in ML, applied an astronomical number of times per training step. (Vanishing/exploding gradients = many chain-rule factors multiplied, collapsing below 1 or growing above it.)

  • Forgetting “evaluated at.” Outer derivative at the inner function: cos(x^2), not cos(x).
  • Dropping the inner derivative. Two factors; do not stop after the outer one.
  • Confusing with the product rule. Multiplied -> sum of two terms; nested -> product of rates.
  • Stopping early on deep nests. One factor per layer; peel outside in.

When functions nest, their rates multiply: one factor per layer, each outer derivative read at the layer below, which is exactly what backpropagation does through a network.