Summary: The chain rule
The product rule handled functions multiplied together; the chain rule handles functions nested one inside another, like sin(x²) or (3x + 1)². It says something simple: rates multiply through a composition. This is the single most-used rule in machine learning, because backpropagation, the algorithm that trains every neural network, is exactly this rule applied through a network’s layers. This is the scan-it-in-five-minutes version.
Core ideas
Section titled “Core ideas”- The chain rule:
d/dx(f(g(x))) = f'(g(x)) · g'(x), the outer derivative (evaluated at the inner function) times the inner derivative. - A composition is a pipeline.
x -> g -> u = g(x) -> f -> f(u). A nudge inxdrivesuat rateg'(x), and that drives the output at ratef'(u). The rates compound (multiply): ifumoves 3× as fast andfamplifies by 2×, the output moves2·3 = 6times the original rate. Each nested layer contributes one multiplying factor. - The “evaluated at” gotcha. The outer derivative is
f'(g(x)), notf'(x): the outer function acts onu = g(x), so its rate is read there. Forsin(x²)it iscos(x²), notcos(x). This is the number-one chain-rule error. - Worked.
(3x+1)²gives2(3x+1)·3 = 6(3x+1)(matches expanding to18x+6);sin(x²)givescos(x²)·2x;(sin x)³gives3 sin²x cos x(opposite nesting, different answer);sin(cos x)gives-sin x·cos(cos x); ande^(2x)gives2e^(2x). Deeper nests apply the rule once per layer, one factor each. - Product rule vs chain rule. Multiplied (
f · g) gives a sum of two terms; nested (f(g(x))) gives a product of rates. Check the structure first.
What changes for you
Section titled “What changes for you”You gain the rule that, more than any other in calculus, makes machine learning work, and the connection is direct, not analogical. A neural network is a deep composition of functions, layer feeding layer, and training needs the derivative of the loss with respect to every buried parameter, which is the chain rule applied layer by layer. “Rates multiply through a composition” is, word for word, what backpropagation does: it sends the rate of change backward through the network, multiplying in each layer’s contribution. Every framework (PyTorch, TensorFlow, JAX) implements this as automatic differentiation, applied an astronomical number of times per training step. It even explains a famous difficulty: vanishing and exploding gradients are just many chain-rule factors multiplied together, collapsing below 1 or growing above it. (Track 11’s backpropagation lesson is this same rule from the network side.) The next lesson examines the one function that is its own derivative, e^x, and why that makes Euler’s number special.