Skip to content

Summary: The chain rule

The product rule handled functions multiplied together; the chain rule handles functions nested one inside another, like sin(x²) or (3x + 1)². It says something simple: rates multiply through a composition. This is the single most-used rule in machine learning, because backpropagation, the algorithm that trains every neural network, is exactly this rule applied through a network’s layers. This is the scan-it-in-five-minutes version.

  • The chain rule: d/dx(f(g(x))) = f'(g(x)) · g'(x), the outer derivative (evaluated at the inner function) times the inner derivative.
  • A composition is a pipeline. x -> g -> u = g(x) -> f -> f(u). A nudge in x drives u at rate g'(x), and that drives the output at rate f'(u). The rates compound (multiply): if u moves 3× as fast and f amplifies by 2×, the output moves 2·3 = 6 times the original rate. Each nested layer contributes one multiplying factor.
  • The “evaluated at” gotcha. The outer derivative is f'(g(x)), not f'(x): the outer function acts on u = g(x), so its rate is read there. For sin(x²) it is cos(x²), not cos(x). This is the number-one chain-rule error.
  • Worked. (3x+1)² gives 2(3x+1)·3 = 6(3x+1) (matches expanding to 18x+6); sin(x²) gives cos(x²)·2x; (sin x)³ gives 3 sin²x cos x (opposite nesting, different answer); sin(cos x) gives -sin x·cos(cos x); and e^(2x) gives 2e^(2x). Deeper nests apply the rule once per layer, one factor each.
  • Product rule vs chain rule. Multiplied (f · g) gives a sum of two terms; nested (f(g(x))) gives a product of rates. Check the structure first.

You gain the rule that, more than any other in calculus, makes machine learning work, and the connection is direct, not analogical. A neural network is a deep composition of functions, layer feeding layer, and training needs the derivative of the loss with respect to every buried parameter, which is the chain rule applied layer by layer. “Rates multiply through a composition” is, word for word, what backpropagation does: it sends the rate of change backward through the network, multiplying in each layer’s contribution. Every framework (PyTorch, TensorFlow, JAX) implements this as automatic differentiation, applied an astronomical number of times per training step. It even explains a famous difficulty: vanishing and exploding gradients are just many chain-rule factors multiplied together, collapsing below 1 or growing above it. (Track 11’s backpropagation lesson is this same rule from the network side.) The next lesson examines the one function that is its own derivative, e^x, and why that makes Euler’s number special.