Skip to content

Summary: The power rule from geometry

Last lesson computed d/dt(t²) = 2t and d/dt(t³) = 3t² by grinding through binomial expansions. Lined up, they reveal a pattern, the power rule, and this lesson shows why it is true by reasoning about growing squares and cubes rather than expanding binomials. The payoff is that you can differentiate any polynomial on sight, and you have a reason for the rule instead of a memorized formula. This is the scan-it-in-five-minutes version.

  • The power rule: d/dt(t^n) = n · t^(n-1). Geometrically, t^n is an n-dimensional cube of side t; nudging the side grows it by a slab on each of n faces, each face of size t^(n-1). The n counts the faces; the t^(n-1) is each face’s size.
  • The growing square. is a square of side t. Nudging to t + dt adds two strips (t·dt each) and a corner (dt²): added area = 2t·dt + dt², over dt is 2t + dt, and as dt -> 0 the corner vanishes, leaving 2t. The 2 is the two strips, the t each strip’s length.
  • The growing cube. adds a slab (t²·dt) on each of three faces, plus edge/corner terms in dt², dt³ that vanish faster. So d/dt(t³) = 3t²: three faces, each of area .
  • Why higher-order terms drop. The strips/slabs are first order in dt and survive after dividing; the corner/edge pieces are higher order and vanish faster as dt -> 0. Dropping them is precise, not sloppy.
  • Every power. The rule holds for negative and fractional exponents: 1/t = t^(-1) -> -1/t², and √t = t^(1/2) -> 1/(2√t).
  • Two linearity rules. Constant-multiple (d/dt(c·f) = c·d/dt(f), stretching a graph scales every slope) and sum (d/dt(f+g) = d/dt(f) + d/dt(g), rates of sums are sums of rates). A constant has derivative 0.
  • Polynomials on sight. Together the three rules turn last lesson’s page of expansion into a quick scan: d/dt(3t⁴ + 2t² - 7) = 12t³ + 4t, with the constant dropping to zero.

You stop expanding binomials and start reading derivatives off directly, and (more importantly) you carry a picture that lets you rederive the rule any time instead of misremembering it. That method, a derivative rule is a consequence of how a quantity grows when you nudge its input, is the through-line for every rule still to come. It also matters for AI: automatic differentiation computes gradients by knowing the derivative of each elementary operation (the power rule, constant-multiple, sum, and a few more) and chaining them through the network. The power rule is especially load-bearing because squaring is everywhere: mean squared error has terms (prediction - target)², whose derivative is 2·(prediction - target), so the gradient is proportional to the error itself, most of the intuition for why squared-error training works. The next lessons add the rules for trickier combinations (trig functions, then the product and chain rules), each derived from a picture rather than handed to you to memorize.