Lesson: The derivative as a rate at an instant
In the first lesson, deriving the area of a circle, we glimpsed the rate side of calculus: the accumulated area, pi R squared, grew at the rate two pi R, its circumference. We waved a hand at “the rate at which the area grows” and moved on. This lesson stops and makes that idea precise, because hiding inside it is a genuine paradox, and resolving the paradox is what the derivative actually is.
Here is the paradox. A derivative is described as the “rate of change at a single instant.” But change is something that happens over a span: to measure how fast a thing is moving, you compare where it is now to where it was a moment ago. Over a single instant of zero duration, nothing moves, nothing changes. So how can there be a rate of change at one frozen point in time? It sounds like a contradiction, and pretending it is not is where most confusion about calculus begins.
Rise over run, with the run shrinking
Section titled “Rise over run, with the run shrinking”The resolution is to stop asking for the rate “at” an instant and ask instead what value the rate approaches as we measure over shorter and shorter spans.
Take any quantity that changes over time, a position, a temperature, an account balance. Over an interval from time to time plus a tiny step, the average rate of change is the familiar rise over run:
average rate = (change in the quantity) / dtNow shrink that tiny step. As the interval gets smaller, this average rate settles toward a specific number. That number, the value the rise-over-run ratio approaches as the step shrinks toward zero, is the derivative. The derivative is not “change over an instant,” which is paradoxical. It is “the limit the average rate approaches as the interval shrinks,” which is perfectly well defined. The whole trick is replacing an impossible “at an instant” with a sensible “as the span vanishes.”
A car on the road
Section titled “A car on the road”Make it concrete with a falling object, whose distance fallen after time, in seconds, is sixteen t squared (feet, roughly, under gravity). How fast is it moving at that instant?
Measure the average velocity over a small interval from time to time plus a tiny step. The distance covered in that interval is the position at the end minus the position at the start, and the average velocity is that distance divided by the tiny step:
average velocity = ( 16(t + dt)^2 - 16t^2 ) / dtExpand the top and simplify, and it reduces to thirty-two t times the tiny step, plus sixteen times the tiny step squared. Divide by the tiny step:
average velocity = 32t + 16·dtNow let the tiny step shrink toward zero. The sixteen-times-the-step term shrinks away to nothing, and the average velocity approaches thirty-two t. So the instantaneous velocity at time t is thirty-two t. At the 2-second instant, the object is falling at thirty-two times 2, which is 64, feet per second. We never divided by zero or evaluated anything “at” a frozen instant; we watched a well-behaved expression settle as the interval vanished.
To see “approaches” as a concrete fact rather than a word, fix the 2-second instant and watch the average velocity, sixty-four plus sixteen times the tiny step, as that step gets smaller:
dt = 1.0 -> 80dt = 0.5 -> 72dt = 0.1 -> 65.6dt = 0.01 -> 64.16The numbers march toward 64 and keep closing the gap as the tiny step shrinks further, but they never need that step to actually reach zero. Sixty-four is the value they approach, and that limit is the instantaneous velocity. The derivative is that target number, read off from how the averages behave, not from any single measurement over a real interval.
The geometric picture: secant to tangent
Section titled “The geometric picture: secant to tangent”The same move has a clean picture on a graph. Plot position against time as a curve. The average velocity over the interval from time to time plus a tiny step is the slope of the straight line connecting two points on the curve, at the start and at the end of that step. That connecting line is called a secant line.
As the tiny step shrinks, the second point slides toward the first, and the secant line pivots. In the limit, when the two points have merged, the secant has rotated into the tangent line, the straight line that just grazes the curve at that single instant. So the derivative is the slope of the tangent line at that point: the steepness of the curve exactly where you are standing. “Rate at an instant” becomes “slope at a point,” and the slope of a tangent is a perfectly ordinary thing.
This is the same event as the numbers from a moment ago. Those converging averages, 80, 72, 65.6, were the slopes of successive secant lines, each cutting the sixteen t squared curve over a shorter interval around the 2-second instant. As the interval shrank the secants rotated toward the tangent, and their slopes closed in on 64, the tangent’s slope. The numerical limit and the geometric limit are one thing seen two ways.
Computing a derivative from scratch
Section titled “Computing a derivative from scratch”Let us do one in full, with no example to lean on, to see the machinery work. Take position equal to t cubed and find its derivative.
The change over that interval is the position at the end minus the position at the start, which is t plus the tiny step, all cubed, minus t cubed. Expand that cube, as the display shows:
(t + dt)^3 = t^3 + 3t^2·dt + 3t·dt^2 + dt^3Subtract t cubed, leaving three terms, each carrying a factor of the tiny step. That is the rise. Divide by the run, the tiny step:
rise / run = 3t^2 + 3t·dt + dt^2Now shrink the tiny step toward zero. The two trailing terms both contain a factor of the tiny step, so they vanish, and what remains is three t squared. The derivative of t cubed is three t squared.
Notice what happened: shrinking the tiny step did not make the calculation harder or more delicate. It made it cleaner, sweeping away the messy step terms and leaving a tidy result. The cubic t cubed has the quadratic three t squared as its rate of change.
Run the same machine on t squared to see a pattern start. The change is t plus the tiny step, all squared, minus t squared, which works out to two t times the step plus the step squared; divide by the step to get two t plus the step; let the step go to zero and you are left with two t. So t squared has derivative two t, and t cubed has derivative three t squared. Two data points: the exponent drops by one and steps down to become a coefficient out front. That is a specific instance of a pattern (the power rule) that the next lesson will name and explain geometrically, so you will not have to expand binomials every time.
What the notation really means
Section titled “What the notation really means”This is the right moment to demystify the notation d-y over d-x (or d-s over d-t, or d-A over d-R). It is not a fraction of two tiny “infinitesimal” numbers that you do strange arithmetic with. It is shorthand for the whole limiting process you just performed: the value that rise over run approaches as the run shrinks to zero. The letter d is a label meaning “the limit of a small change in.” When you read d-s over d-t in a paper, read it as “the rate at which the position s changes with time t,” computed as a limit, not as a literal division.
And the derivative is itself a function. Differentiating the position sixteen t squared did not give a single number; it gave thirty-two t, a new function with a value at every instant. The derivative takes a function (position over time) and produces another function (velocity over time), assigning to every instant the slope of the original curve there.
Why this matters when you use AI
Section titled “Why this matters when you use AI”The first lesson said training a model means following the derivative of its loss downhill. This lesson is what that derivative actually is: the limit of rise over run, the slope of the loss surface in each parameter’s direction. When training nudges a parameter, it asks “if I change this parameter by a tiny amount, how much does the loss change,” which is exactly the rise-over-run-as-the-run-shrinks question, answered for millions of parameters at once.
The “derivative is a function” idea matters too. The gradient of a loss is not one number; it is a whole vector of derivatives, one slope per parameter, recomputed at every training step because the slope changes as the parameters move. And the way frameworks compute these derivatives, automatic differentiation, gets the exact limit (not a numerical approximation) by applying derivative rules symbolically through the network, which is precisely why the next lesson, on those rules, is the practical workhorse.
Common pitfalls
Section titled “Common pitfalls”Reading d-y over d-x as a fraction of infinitesimals. It is the notation for a limit, not a division of two tiny numbers. Treating it as literal infinitesimal arithmetic is the old, confusing way; the limit is the clear one.
Thinking the derivative is the rate “at” an instant. Strictly, nothing changes in zero time. The derivative is the value the average rate approaches as the interval shrinks, the rate “around” the instant, captured as a limit. That distinction is the entire resolution of the paradox.
Forgetting the derivative is a function. The derivative, s-prime, is not one number; it is a new function giving the slope at every point. Asking “what is the derivative” yields a function; asking “what is the derivative at the 2-second instant” yields a number.
Plugging the tiny step to zero directly. You cannot set the step to zero in rise over the step, because that is zero divided by zero. You simplify first (cancel the step in the denominator), and only then let the step shrink to zero. The order matters.
What you should remember
Section titled “What you should remember”- A derivative is the limit of rise over run as the run shrinks to zero. It resolves the paradox of an “instantaneous rate” by replacing the impossible “rate at an instant” with the sensible “value the average rate approaches as the interval vanishes.”
- Geometrically, the derivative is the slope of the tangent line at a point: the secant line through two nearby points, in the limit as those points merge. Compute it by forming the change in position over a tiny step, divided by that step, simplifying, and letting the step go to zero, as in t cubed becoming three t squared.
- The notation d-y over d-x is shorthand for that limit, not a fraction, and the derivative is itself a function that gives the rate of change at every point. Differentiating position gives velocity at every instant.
The paradox dissolves once you stop demanding a rate “at” a frozen instant and instead watch what the rate approaches as the span shrinks. You computed one derivative from scratch by expanding a binomial, which works but is tedious. The next lesson finds the patterns, the derivative rules, that let you read off most derivatives without re-deriving them every time.