Summary: The power rule from geometry
Last lesson computed d/dt(t²) = 2t and d/dt(t³) = 3t² by grinding through binomial expansions. Lined up, they reveal a pattern, the power rule, and this lesson shows why it is true by reasoning about growing squares and cubes rather than expanding binomials. The payoff is that you can differentiate any polynomial on sight, and you have a reason for the rule instead of a memorized formula. This is the scan-it-in-five-minutes version.
Core ideas
Section titled “Core ideas”- The power rule:
d/dt(t^n) = n · t^(n-1). Geometrically,t^nis ann-dimensional cube of sidet; nudging the side grows it by a slab on each ofnfaces, each face of sizet^(n-1). Thencounts the faces; thet^(n-1)is each face’s size. - The growing square.
t²is a square of sidet. Nudging tot + dtadds two strips (t·dteach) and a corner (dt²): added area= 2t·dt + dt², overdtis2t + dt, and asdt -> 0the corner vanishes, leaving2t. The 2 is the two strips, the t each strip’s length. - The growing cube.
t³adds a slab (t²·dt) on each of three faces, plus edge/corner terms indt²,dt³that vanish faster. Sod/dt(t³) = 3t²: three faces, each of areat². - Why higher-order terms drop. The strips/slabs are first order in
dtand survive after dividing; the corner/edge pieces are higher order and vanish faster asdt -> 0. Dropping them is precise, not sloppy. - Every power. The rule holds for negative and fractional exponents:
1/t = t^(-1) -> -1/t², and√t = t^(1/2) -> 1/(2√t). - Two linearity rules. Constant-multiple (
d/dt(c·f) = c·d/dt(f), stretching a graph scales every slope) and sum (d/dt(f+g) = d/dt(f) + d/dt(g), rates of sums are sums of rates). A constant has derivative0. - Polynomials on sight. Together the three rules turn last lesson’s page of expansion into a quick scan:
d/dt(3t⁴ + 2t² - 7) = 12t³ + 4t, with the constant dropping to zero.
What changes for you
Section titled “What changes for you”You stop expanding binomials and start reading derivatives off directly, and (more importantly) you carry a picture that lets you rederive the rule any time instead of misremembering it. That method, a derivative rule is a consequence of how a quantity grows when you nudge its input, is the through-line for every rule still to come. It also matters for AI: automatic differentiation computes gradients by knowing the derivative of each elementary operation (the power rule, constant-multiple, sum, and a few more) and chaining them through the network. The power rule is especially load-bearing because squaring is everywhere: mean squared error has terms (prediction - target)², whose derivative is 2·(prediction - target), so the gradient is proportional to the error itself, most of the intuition for why squared-error training works. The next lessons add the rules for trickier combinations (trig functions, then the product and chain rules), each derived from a picture rather than handed to you to memorize.