The chain rule: rates multiply

Last lesson handled functions multiplied together, f times g, with the product rule. This lesson handles the other way functions combine: nested one inside another, like sine of x squared or the quantity 3x plus 1, squared, where one function’s output is fed into another. The rule for this is the chain rule, and it is worth caring about beyond calculus class for one reason: it is the single most-used rule in machine learning. Backpropagation, the algorithm that trains every neural network, is the chain rule applied through the layers of a network. Get this one solid.

Here is the rule:

d/dx( f(g(x)) ) = f'(g(x)) · g'(x)

The derivative of the composition equals the outer function’s derivative (evaluated at the inner function) times the inner function’s derivative. Two pieces, multiplied. This lesson explains why they multiply and why the “evaluated at the inner function” part is the step everyone trips on.

Nested functions are sequential transformations

Read f of g of the input as a two-step pipeline. The input goes into the inner function g, which produces an intermediate value, call it the inner function’s output. Then that intermediate value goes into the outer function f, which produces the final output. Two transformations in a row: the input becomes the intermediate value becomes the output.

x  --g-->  u = g(x)  --f-->  f(u)

Now ask how the final output changes when you nudge the input. The nudge first passes through the inner function g, which changes the intermediate value at the inner function’s rate of change. Then that change in the intermediate value passes through the outer function f, which changes the output at the outer function’s rate of change. The total effect is the two rates multiplied: the rate of the output with respect to the input equals the rate of the outer function with respect to the intermediate value, times the rate of the intermediate value with respect to the input.

Why the rates multiply

The multiplication is a compounding of speeds. Suppose nudging the input makes the intermediate value change three times as fast (the inner derivative is 3), and suppose the outer function changes its output twice as fast as its input changes (the outer derivative is 2). Then a nudge in the input drives the intermediate value at 3 times the rate, and that motion in the intermediate value drives the output at 2 times its rate, so the output moves at 2 times 3, which is 6, times the original nudge. Each transformation in the chain contributes its own multiplying factor, and the factors compound. That is all the chain rule is: the rate through a composition is the product of the rates of each step.

Put numbers on a real composition. Take the quantity 3x plus 1, squared, at an input of 1. The inner function, 3x plus 1, equals 4 there and changes at a rate of 3. The outer square has a derivative of twice the inner function, which at an intermediate value of 4 is 8. Multiply the two stage-rates: 8 times 3 is 24. Check it directly: the quantity 3x plus 1, squared, expands to 9 x squared plus 6x plus 1, which has derivative 18x plus 6, which at an input of 1 is 24. The two factors, 8 and 3, are exactly the rates of the outer and inner steps, and their product is the rate of the whole composition.

Read it left to right: a small nudge in x is first amplified by the inner rate g'(x), turning into a nudge in u = g(x); that nudge is then amplified by the outer rate f'(u). Two amplifications in a row multiply, so the total rate of change is f'(g(x)) times g'(x). That product is the chain rule.

The step everyone trips on: “evaluated at”

The outer derivative is the outer function’s derivative evaluated at the inner function, not the outer derivative evaluated at the bare input. This is the classic chain-rule mistake, and it matters. The outer function f does its work on the intermediate value produced by the inner function, so its rate of change must be measured there, at that intermediate value, not back at the input. When you write the outer derivative, you plug the whole inner function into it.

Concretely, for sine of x squared: the outer function is sine, whose derivative is cosine, but it is cosine of x squared, not cosine of the input, because the sine is acting on x squared. Forgetting to evaluate the outer derivative at the inner function is the single most common chain-rule error. Keep the pipeline picture in mind: the outer function lives downstream, operating on the intermediate value, so its rate is read at that intermediate value.

Worked examples

A polynomial composition. Differentiate the quantity 3x plus 1, squared. The outer function is “square it” (derivative twice the inner function), the inner is 3x plus 1 (derivative 3). The chain rule gives:

d/dx( (3x+1)^2 ) = 2(3x+1) · 3 = 6(3x+1)

Check it by expanding first: the quantity 3x plus 1, squared, equals 9 x squared plus 6x plus 1, whose derivative is 18x plus 6, which factors as 6 times the quantity 3x plus 1. The two routes agree, and the chain rule got there without expanding.

A trig function composed with a power. Differentiate sine of x squared. Outer is sine (derivative cosine, from the trig lesson), inner is x squared (derivative 2x, from the power rule). The chain rule gives:

d/dx( sin(x^2) ) = cos(x^2) · 2x = 2x·cos(x^2)

Notice the outer derivative is cosine of x squared, evaluated at the inner function x squared, not cosine of the input. Three earlier lessons cooperate in one line: the chain rule, the trig derivative, and the power rule.

A power of a function. Differentiate sine of the input, cubed. Here the power is on the outside and the function is inside, the mirror image of sine of x squared. Outer is the inner function cubed (derivative three times the inner function squared), inner is sine of the input (derivative cosine of the input):

d/dx( (sin(x))^3 ) = 3(sin(x))^2 · cos(x) = 3·sin^2(x)·cos(x)

Compare with sine of x squared, where sine was the outer function and the power was inside. Same two ingredients, opposite nesting, different answers. Identifying which function is outer and which is inner is the first move every time, before you reach for any derivative.

A double nesting. Differentiate sine of cosine of the input, a function inside a function inside sine. Outer is sine (derivative cosine), inner is cosine of the input (derivative negative sine of the input):

d/dx( sin(cos(x)) ) = cos(cos(x)) · (-sin(x)) = -sin(x)·cos(cos(x))

The outer derivative, cosine of cosine of the input, is evaluated at the inner cosine of the input. For deeper nests, you apply the chain rule once per layer, multiplying a rate for each step in the pipeline. The mechanical recipe: differentiate the outermost function (evaluated at everything inside it), multiply by the derivative of what was inside, and repeat, peeling one layer at a time until you reach the bare input. Each peel contributes exactly one factor to the product.

A preview of Euler’s number. Differentiate Euler’s number raised to 2x. If you accept for now (the next lesson proves it) that Euler’s number raised to the input is its own derivative, then the outer function, Euler’s number raised to the intermediate value, has itself as its derivative, and the inner 2x has derivative 2, so the derivative of Euler’s number raised to 2x is Euler’s number raised to 2x, times 2, which is 2 times Euler’s number raised to 2x. The chain rule is what lets exponentials with a rate baked into the exponent differentiate cleanly, which is why they model growth and decay so well.

Why this matters when you use AI

This is the rule, more than any other in calculus, that makes machine learning work, and the connection is direct rather than analogical.

A neural network is a deep composition of functions: the input passes through layer one, whose output passes through layer two, and so on through dozens or hundreds of layers, each a function of the one before. Training requires the derivative of the final loss with respect to every parameter buried in those layers. That is a derivative of a deeply nested composition, and computing it is exactly the chain rule applied layer by layer. The rule’s “rates multiply through a composition” is, word for word, what backpropagation does: it sends the rate of change backward through the network, multiplying in each layer’s contribution as it goes. (Track 11’s lesson on backpropagation is this same rule, viewed from the network side; the rate-multiplication here is what “gradients flow backward through the layers” means there.)

Every deep learning framework, PyTorch, TensorFlow, JAX, implements automatic differentiation, which is the chain rule applied programmatically across a model’s computation graph. On a large model, the chain rule is applied an astronomical number of times per training step. It is, by a wide margin, the most-used calculus rule in the field. The reason long chains can make gradients explode or vanish, a central difficulty in training deep networks, is also this rule: many factors multiplied together grow without bound if they exceed 1, or collapse to nothing if they fall below it.

Common pitfalls

Forgetting “evaluated at the inner function.” The outer derivative is the outer function’s derivative evaluated at the inner function, not evaluated at the bare input. For sine of x squared it is cosine of x squared, not cosine of the input. The outer function acts on the inner function’s output, so its rate is read there. This is the chain rule’s number-one error.

Dropping the inner derivative. The chain rule has two factors. Writing the derivative of sine of x squared as just cosine of x squared and stopping forgets the times 2x. Every layer of nesting contributes a factor; miss one and the answer is wrong.

Confusing it with the product rule. The product rule is for functions multiplied (f times g); the chain rule is for functions nested (f of g of the input). Multiplied gives a sum of two terms; nested gives a product of rates. Check which structure you actually have before reaching for a rule.

Applying it to a single layer when there are several. For sine of cosine of x squared there are three layers, so three factors. Peel from the outside in, one chain-rule factor per layer, and do not stop early.

What you should remember

The chain rule says the derivative of f of g of the input equals the outer function’s derivative evaluated at the inner function, times the inner function’s derivative: the outer derivative read at the inner function, times the inner derivative. It comes from reading a composition as a pipeline of sequential transformations whose rates multiply.
Rates multiply through a composition. If the inner step changes its output by its own rate of change and the outer step amplifies by its rate of change, the whole composition moves at the outer rate times the inner rate, multiplied, relative to the original. Each nested layer contributes one factor; deeper nests multiply more factors.
The classic error is “evaluated at”: the outer derivative is taken at the inner function’s value, so the derivative of sine of x squared is cosine of x squared, times 2x, with cosine of x squared not cosine of the input. This rule is the engine of backpropagation, the chain rule run through a network’s layers, and the most-used calculus rule in machine learning.

When functions nest, their rates multiply: one factor per layer, each outer derivative read at the layer below it. That single idea scales from sine of x squared to a hundred-layer network, where it goes by the name backpropagation. The next lesson examines the one function that is its own derivative, Euler’s number raised to the input, and why that makes Euler’s number special.