Summary: The derivative as a rate

A derivative is supposed to be the rate of change at a single instant, but over an instant nothing changes, so how can there be a rate? That paradox is where most confusion about calculus begins, and this lesson dissolves it with one idea: the derivative is not the rate at an instant but the value the average rate approaches as the measuring interval shrinks to zero. With that, “rate at an instant” becomes “slope at a point,” and you can compute a derivative from scratch. This is the scan-it-in-five-minutes version.

Core ideas

The paradox and the fix. Change happens over a span; over zero duration nothing moves. So instead of the rate “at” an instant, take the value the average rate approaches as the interval dt shrinks toward zero. That limit is the derivative, and it is perfectly well defined.
Rise over run, with the run shrinking. The average rate over [t, t + dt] is (change in the quantity) / dt. Let dt shrink and the ratio settles on a number; that number is the derivative.
A worked rate. For free fall s(t) = 16t², the average velocity is (32t·dt + 16·dt²)/dt = 32t + 16·dt, which approaches 32t as dt -> 0. At t = 2, that is 64 ft/s, and the averages 80, 72, 65.6, 64.16 visibly march toward 64 as dt shrinks, without ever needing dt = 0.
Secant to tangent. Geometrically, the average rate is the slope of the secant line through two points; as dt shrinks the points merge and the secant pivots into the tangent line. The derivative is the slope of the tangent at a point.
Computing one from scratch. Form (s(t + dt) - s(t))/dt, simplify, then let dt -> 0. For t³: ((t+dt)³ - t³)/dt = 3t² + 3t·dt + dt² -> 3t². (And t² -> 2t.) The exponent drops by one and steps out front, a first glimpse of the power rule.
What dy/dx means. Shorthand for that limit, not a fraction of infinitesimals. And the derivative is itself a function: differentiating position gives velocity at every instant, a slope at every point.

What changes for you

The derivative stops being a mysterious “instantaneous” quantity and becomes a concrete recipe: rise over run, with the run shrinking, read off as the value the averages approach. That is exactly what training a model computes, millions of times. When training nudges a parameter, it asks “if I change this by a tiny amount, how much does the loss change,” which is the rise-over-run-as-the-run-shrinks question; the gradient is a whole vector of these derivatives, one slope per parameter, recomputed at every step because the slopes move as the parameters do. The notation ds/dt that looked like decoration now reads as “the rate at which s changes,” computed as a limit. The lesson computed one derivative the slow way, by expanding a binomial; the next finds the patterns (the derivative rules) that let you read most derivatives off directly, starting with the power rule.