The power rule, from geometry

Last lesson, computing the derivative of t squared and t cubed from scratch, two answers fell out: the derivative of t squared is 2t, and the derivative of t cubed is 3 t squared. Lined up, they show a pattern. The exponent comes down to the front as a multiplier, and the exponent itself drops by one:

d/dt(t^2) = 2·t^1
d/dt(t^3) = 3·t^2

The conjecture writes itself: the derivative of t to the n is n times t to the n minus one. This is the power rule, the single most-used fact in differentiation. But a pattern guessed from two cases is not understanding. This lesson shows why it is true, by reasoning about geometry rather than expanding binomials, and the reason is so clean you will never need to grind through the binomial t plus the nudge, all raised to the n, again.

Why the derivative of t² is 2t: a growing square

Picture t squared as the literal area of a square with side length t. Now nudge the side out by a tiny amount, the nudge, making it t plus the nudge. How much area did you add?

The square grew along two edges. You added a thin strip along the top (length t, width the nudge, area t times the nudge) and an identical strip along the right side (another t times the nudge). Where those two strips meet in the corner, you also added a tiny square of side the nudge, area the nudge squared. So the total area added is:

added area = 2·(t · dt) + dt^2 = 2t·dt + dt^2

The rate of change is the added area divided by the nudge: 2t plus the nudge. As the nudge shrinks to zero, the corner term, the nudge squared, is negligible (a tiny square against two long strips), so the nudge vanishes and you are left with 2t. The derivative of t squared is 2t, and now you can see where the pieces come from: the 2 is the two strips that grew, and the t is the length of each strip.

Why is the corner safe to drop? Compare sizes. The two strips together add about 2t times the nudge, an amount proportional to the nudge. The corner adds the nudge squared, the product of two shrinking quantities. When the nudge is 0.001, each strip contributes roughly 0.001 times t while the corner contributes 0.000001, a thousand times smaller, and the gap only widens as the nudge shrinks further. In the limit the corner’s share of the rate is exactly zero. Discarding the higher-power nudge terms is not sloppiness; it is the precise statement that they vanish faster than the term you keep.

Grow a square of side t by a small dt. You add two thin strips (one on top, one on the right), each of area t times dt, plus a tiny dt-by-dt corner square. As dt shrinks, the corner shrinks faster than the strips do, so it vanishes from the rate. The derivative of t² is 2t, two strips' worth of growth per unit of new side.

Why the derivative of t³ is 3t²: a growing cube

The same picture, one dimension up. Let t cubed be the volume of a cube with side t, and nudge the side to t plus the nudge. The cube grows by adding a thin slab on each of three faces, each slab a square of area t squared and thickness the nudge, so each adds t squared times the nudge. Three faces means 3 times t squared times the nudge. There are also slivers along the edges and a tiny corner cube, but those involve the nudge squared and the nudge cubed, which vanish faster than the slabs as the nudge shrinks. So:

added volume ≈ 3·(t^2 · dt) = 3t^2·dt   (plus terms that vanish)

Divide by the nudge and let the nudge go to zero: the derivative of t cubed is 3 t squared. Again the structure is visible: the 3 is the three faces that grew, and the t squared is the area of each face.

One dimension up: a cube of side t grows by dt on three faces, adding three thin slabs of area t² and thickness dt. To leading order, the added volume is 3 times t² times dt, so the derivative of t³ is 3t². The geometric pattern is identical to the square: n thin pieces of one less dimension, where n is the power.

The general power rule

Now the pattern means something. For t to the n, picture an n-dimensional cube. Nudge the side by the nudge, and it grows by a slab on each of its n faces, each slab having the “size” of one face, t to the n minus one, times the thickness the nudge. That gives n times t to the n minus one times the nudge of new content, and dividing by the nudge leaves:

d/dt(t^n) = n · t^(n-1)

The n counts the faces that grow when you bump a single dimension; the t to the n minus one is the size of each face. You cannot draw a 5-dimensional cube, but the algebra of “how many faces, how big each” carries over exactly, which is the payoff of having a reason rather than a memorized rule.

The rule is not limited to whole-number powers. It holds for negative and fractional exponents too. For one over t, which is t to the negative one, the rule gives the derivative of t to the negative one is negative one times t to the negative two, or negative one over t squared (a negative derivative, since one over t decreases as t grows). For the square root of t, which is t to the one-half, it gives the derivative of t to the one-half is one-half times t to the negative one-half, or one over two root t. One rule, every power.

Two rules that make derivatives composable

The power rule handles single powers. Two more rules let you differentiate any combination of them, and both are intuitive.

The constant-multiple rule. Scaling a function by a constant just scales its rate of change by the same constant:

d/dt( c · f(t) ) = c · d/dt( f(t) )

Geometrically, multiplying a function by a constant stretches its graph vertically by that constant, which stretches every slope by it too. If a function changes at some rate, then five times that function changes five times as fast.

The sum rule. The rate of change of a sum is the sum of the rates of change:

d/dt( f(t) + g(t) ) = d/dt( f(t) ) + d/dt( g(t) )

If two quantities each grow at their own rate, the rate of their total is just the two rates added. Stacking one changing quantity on another adds their slopes.

Putting it together

With those three rules, differentiating a polynomial becomes a fast, mechanical scan, no binomial expansion anywhere. Take 3 t to the fourth plus 2 t squared minus 7:

d/dt(3t^4)  = 3 · 4t^3 = 12t^3     (power rule + constant multiple)
d/dt(2t^2)  = 2 · 2t   = 4t        (power rule + constant multiple)
d/dt(-7)    = 0                    (a constant never changes, so its rate is 0)

Sum the pieces (sum rule): the derivative of 3 t to the fourth plus 2 t squared minus 7 is 12 t cubed plus 4t. What took a page of binomial expansion last lesson is now three quick lines. The constant negative 7 has derivative zero, which makes sense: a constant has no rate of change, and shifting a graph up or down does not change any of its slopes.

The same scan works when the powers are not whole numbers. Take 4 root t plus 6 over t, rewritten with exponents as 4 t to the one-half plus 6 t to the negative one:

d/dt(4t^(1/2)) = 4 · (1/2)·t^(-1/2) = 2·t^(-1/2) = 2/√t
d/dt(6t^(-1))  = 6 · (-1)·t^(-2)    = -6·t^(-2)  = -6/t^2

So the derivative of 4 root t plus 6 over t is 2 over root t minus 6 over t squared. The fractional power gave a fractional power back; the negative power gave a negative derivative (since 6 over t falls as t grows). No new machinery, just the one power rule applied term by term.

The same geometric style of reasoning extends past powers entirely. The derivatives of the trig functions, for instance, come from the geometry of a point moving around a circle (the next lessons take up more rules, including those). The lesson here is the method: a derivative rule is not a fact to memorize but a consequence of how a quantity grows when you nudge its input.

Why this matters when you use AI

These rules are exactly what a machine-learning framework applies, automatically and at enormous scale, to compute the gradients that train a model. Automatic differentiation works by knowing the derivative of each elementary operation (the power rule, the constant-multiple rule, the sum rule, and a few more) and chaining them through the network. Every gradient in training is built from rules like these.

The power rule in particular is everywhere, because squaring is everywhere. The most common training loss is mean squared error, built from terms like the quantity prediction minus target, squared. Its derivative with respect to the prediction is, by the power rule, 2 times the quantity prediction minus target: the gradient is proportional to the error itself. That single application of the power rule is why squared-error training nudges parameters in proportion to how wrong they are, which is most of the intuition behind why it works.

Common pitfalls

Memorizing the power rule without the picture. The n is the number of faces that grow when you nudge one dimension; the t to the n minus one is each face’s size. Hold the growing-square and growing-cube images and you can rederive it any time, and you will not misremember it as n times t to the n, or n minus one times t to the n minus one.

Dropping the strips but keeping the corner. When the nudge shrinks, the corner term (the nudge squared, the nudge cubed) vanishes faster than the strips, so it is the corner you discard, not the strips. The strips are first-order in the nudge and survive after dividing; the corner is higher-order and dies.

Forgetting a constant’s derivative is zero. A constant does not change, so its rate of change is zero. In a polynomial, the constant term simply drops out when you differentiate.

Thinking the rule is only for whole-number powers. The rule, the derivative of t to the n is n times t to the n minus one, holds for negative and fractional n too: one over t differentiates to negative one over t squared, and root t to one over two root t.

What you should remember

The power rule, the derivative of t to the n is n times t to the n minus one, comes from geometry: nudging the side of an n-dimensional cube grows it by a slab on each of n faces, each face of size t to the n minus one. The 2t for a square is two growing strips; the 3 t squared for a cube is three growing slabs. It holds for negative and fractional powers too.
Two linearity rules make derivatives composable: the constant-multiple rule (the derivative of a constant times a function is the constant times the derivative of the function, scaling a function scales its slope) and the sum rule (the derivative of a sum of two functions is the sum of their derivatives, rates of sums are sums of rates).
Together they differentiate any polynomial in a few mechanical lines, turning last lesson’s page of binomial expansion into a quick scan. The derivative of 3 t to the fourth plus 2 t squared minus 7 is 12 t cubed plus 4t, with the constant dropping to zero.

You now have a reason for the power rule, not just the rule, and two linearity rules that let you differentiate any polynomial on sight. The next lessons add more rules: the trig derivatives from circle geometry, then the product and chain rules for combining functions in trickier ways, each derived from a picture rather than handed to you to memorize.