Summary: The derivative as a rate
A derivative is supposed to be the rate of change at a single instant, but over an instant nothing changes, so how can there be a rate? That paradox is where most confusion about calculus begins, and this lesson dissolves it with one idea: the derivative is not the rate at an instant but the value the average rate approaches as the measuring interval shrinks to zero. With that, “rate at an instant” becomes “slope at a point,” and you can compute a derivative from scratch. This is the scan-it-in-five-minutes version.
Core ideas
Section titled “Core ideas”- The paradox and the fix. Change happens over a span; over zero duration nothing moves. So instead of the rate “at” an instant, take the value the average rate approaches as the interval
dtshrinks toward zero. That limit is the derivative, and it is perfectly well defined. - Rise over run, with the run shrinking. The average rate over
[t, t + dt]is (change in the quantity) /dt. Letdtshrink and the ratio settles on a number; that number is the derivative. - A worked rate. For free fall
s(t) = 16t², the average velocity is(32t·dt + 16·dt²)/dt = 32t + 16·dt, which approaches32tasdt -> 0. Att = 2, that is64ft/s, and the averages80, 72, 65.6, 64.16visibly march toward 64 asdtshrinks, without ever needingdt = 0. - Secant to tangent. Geometrically, the average rate is the slope of the secant line through two points; as
dtshrinks the points merge and the secant pivots into the tangent line. The derivative is the slope of the tangent at a point. - Computing one from scratch. Form
(s(t + dt) - s(t))/dt, simplify, then letdt -> 0. Fort³:((t+dt)³ - t³)/dt = 3t² + 3t·dt + dt² -> 3t². (Andt² -> 2t.) The exponent drops by one and steps out front, a first glimpse of the power rule. - What
dy/dxmeans. Shorthand for that limit, not a fraction of infinitesimals. And the derivative is itself a function: differentiating position gives velocity at every instant, a slope at every point.
What changes for you
Section titled “What changes for you”The derivative stops being a mysterious “instantaneous” quantity and becomes a concrete recipe: rise over run, with the run shrinking, read off as the value the averages approach. That is exactly what training a model computes, millions of times. When training nudges a parameter, it asks “if I change this by a tiny amount, how much does the loss change,” which is the rise-over-run-as-the-run-shrinks question; the gradient is a whole vector of these derivatives, one slope per parameter, recomputed at every step because the slopes move as the parameters do. The notation ds/dt that looked like decoration now reads as “the rate at which s changes,” computed as a limit. The lesson computed one derivative the slow way, by expanding a binomial; the next finds the patterns (the derivative rules) that let you read most derivatives off directly, starting with the power rule.