Skip to content

Lesson: Taylor series, approximating anything with polynomials

This track opened with a question: you know the area of a circle is pi times the radius squared, but why? Thirteen lessons later, you can answer it (slice into rings, integrate two pi times the radius, get pi times the radius squared), and along the way you built the entire machine that makes the answer work: rates and accumulations, the rules for differentiating powers, trig functions, exponentials, products, compositions, and the fundamental theorem binding it all together. This final lesson is where that machine produces its most powerful single result.

The result is the Taylor series, and the idea is audacious: take any well-behaved function, no matter how complicated, and rebuild it near a point as a plain polynomial, using nothing but the function’s derivatives at that point. Sine, the exponential, the logarithm, all of them become sums of powers of the input, as easy to compute as arithmetic. It is the capstone because it requires everything you have learned, and it is the bridge to the math that machine learning runs on.

Polynomials are the friendliest functions there are. To evaluate three plus two times the input plus the input squared you only multiply and add; a computer, or a person, can do it directly. Functions like sine of the input or Euler’s number raised to the input are not like that, there is no finite arithmetic recipe for the sine of an arbitrary number. So the question is irresistible: can we approximate a complicated function by a polynomial, closely enough to be useful? Taylor’s answer is yes, and the polynomial is built entirely from the function’s derivatives.

Near a point a, a function f is approximated by:

f(x) ≈ f(a) + f'(a)·(x-a) + (f''(a)/2!)·(x-a)^2 + (f'''(a)/3!)·(x-a)^3 + ...

Each term uses one more derivative of the function f, evaluated at the center a, and attaches it to a power of the input minus a. The first term pins the value, the second pins the slope, the third pins the concavity, and each further term matches one more order of the function’s behavior at the center a.

Why the factorials? They are bookkeeping that makes the matching exact. When you differentiate the input minus a raised to the term number k once you get k times the input minus a raised to the k minus one, again k times k minus one times the input minus a raised to the k minus two, and after k differentiations you have produced k factorial, that is k times k minus one and on down to two times one. Dividing the k-th term by k factorial cancels exactly that buildup, so that the k-th derivative of the k-th term, evaluated at the center a, comes out to precisely the k-th derivative of the function at a. The factorial is what guarantees the matching property: the Taylor polynomial and the function f share the same value, the same slope, the same second derivative, the same third derivative, and so on, all the way up, at the point a. That is what it means for the polynomial to approximate the function there.

Watching the approximation build is the clearest way to feel it. Keep only the first two terms, the value of the function at a plus the first derivative at a times the input minus a, and you have the tangent line at the center a, the best straight-line fit (this is the small-angle and Newton’s-method object we will return to). Add the third term and the straight line bends into a parabola that now matches the curve’s concavity at a. Add the fourth and it bends again to match the rate at which the concavity changes. Each new term lets the polynomial hug the curve through one more order of agreement, and near the center a the fit gets visibly tighter with every term.

The curve cos(x) with its Taylor approximations at a = 0: the tangent line y = 1, the parabola 1 minus x squared over 2, and the 4th order curve 1 minus x squared over 2 plus x to the fourth over 24, each hugging cos more closely over a wider range A coordinate graph over x from minus 3 to 3, with the center point a = 0 marked on the curve. The function cos(x) is plotted as the hero curve in teal, a smooth wave peaking at 1 above the origin. Three Taylor approximations centered at a = 0 are overlaid. The 1st order approximation, the tangent line, is the flat horizontal line y = 1 in accent purple, matching cos only right near 0. The 2nd order approximation is the downward parabola 1 minus x squared over 2 in amber, which tracks cos near 0 but dives away as x moves outward. The 4th order approximation, 1 minus x squared over 2 plus x to the fourth over 24 in sky blue, hugs cos far more closely over a much wider interval before separating. Near the center a = 0 all four curves coincide. The teaching point: adding more Taylor terms makes the polynomial match the function over a wider range around the center a. -3 -2 -1 1 2 3 x y a = 0 cos(x) Taylor series at a = 0: cos(x) 1st order (tangent) y = 1 2nd order (parabola) y = 1 - x²/2 4th order y = 1 - x²/2 + x⁴/24
Centered at a = 0, each Taylor polynomial of cos(x) hugs the curve over a wider range than the one before it. The 1st-order term is just the tangent line y = 1 (cos has zero slope at 0). The 2nd-order parabola y = 1 - x²/2 bends downward with the curve near 0, then dives away. The 4th-order polynomial y = 1 - x²/2 + x⁴/24 tracks cos far longer before separating. Add more terms and the polynomial matches the function over an ever wider interval around the center.

Far from the center a the approximation can drift away (and for some functions it only stays accurate within a limited distance), but near the point, more terms always means a better match.

Build the Taylor series at the center a equal to zero for the track’s recurring functions, and watch earlier lessons pay off.

The exponential. Every derivative of Euler’s number raised to the input is itself (the Euler’s-number lesson), and Euler’s number raised to the zero is one, so every coefficient is one. The series is the cleanest in mathematics:

e^x = 1 + x + x^2/2! + x^3/3! + x^4/4! + ...

Test it at the input equal to one: one plus one plus one-half plus one-sixth plus one over twenty-four plus one over one-hundred-twenty equals 2.7167, already close to Euler’s number, about 2.71828, and tightening with each term. This is why Euler’s number raised to the input is its own derivative: differentiate the series term by term and it reproduces itself.

Sine. The derivatives of sine at zero cycle: sine of zero is zero, cosine of zero is one, negative sine of zero is zero, negative cosine of zero is negative one, and then repeat. The coefficients are zero, one, zero, negative one, zero, one, and on, so only odd powers survive, with alternating signs:

sin(x) = x - x^3/3! + x^5/5! - x^7/7! + ...

Test at the input equal to pi over two, about 1.5708: 1.5708 minus 0.6460 plus 0.0797 minus 0.0047, which is about 0.9998, converging to the sine of pi over two, which is one.

Watch the “wraps tighter” idea numerically at the input equal to one, where the true value is the sine of one, which is 0.84147:

1 term (x) -> 1.00000
2 terms (x - x^3/6) -> 0.83333
3 terms (+ x^5/120) -> 0.84167
4 terms (- x^7/5040) -> 0.84147

Each term pulls the estimate closer, and by the fourth it matches the sine of one to five decimals. The polynomial is reconstructing the sine, one derivative at a time.

Cosine. By the same cycle starting from the cosine of zero, which is one, only even powers survive:

cos(x) = 1 - x^2/2! + x^4/4! - x^6/6! + ...

Three things from earlier in the track were stated on intuition and are now provable.

The small-angle approximation. The trig lesson claimed that the sine of the input is approximately the input itself for small inputs, because the slope of sine at zero is one. That is exactly the first two terms of sine’s Taylor series, the sine of the input is about the input. The next term reveals how good the approximation is and how to improve it: the sine of the input is about the input minus the input cubed over six. At the input equal to 0.5, plain input gives 0.5, while the input minus the input cubed over six is 0.5 minus 0.0208, which is 0.4792, almost exactly the true sine of 0.5, which is 0.4794. The “approximation” was the first slice of a series all along.

L’Hôpital’s rule. The limits lesson said L’Hôpital works by keeping each function’s leading behavior near a point. That leading behavior is precisely the first-order Taylor term, the value of the function at a plus the first derivative at a times the input minus a. When numerator and denominator both vanish at the center a, their ratio is governed by their first Taylor terms, which is the first derivative of the numerator divided by the first derivative of the denominator. L’Hôpital is first-order Taylor wearing a different hat.

Higher-order derivatives. The previous lesson built the tower of derivatives, the first, second, third, and on, and showed a degree-four polynomial differentiating down to a constant and then zero. That tower is exactly the list of ingredients Taylor needs: the k-th term uses the k-th derivative. The reason a polynomial equals its own Taylor series is that its tower terminates, so the series is finite and exact.

Worked example: Newton’s method is Taylor at work

Section titled “Worked example: Newton’s method is Taylor at work”

The first-order Taylor term is not just for approximating values; it powers a fast way to solve equations. Newton’s method finds a zero of a function by repeatedly replacing the function with its tangent line (its first-order Taylor approximation) and jumping to where that line crosses zero. The update is the current guess minus the function value divided by the first derivative, which is exactly the solution of the function value plus the first derivative times the new guess minus the current guess set equal to zero.

Use it to compute the square root of two, the positive zero of the function the input squared minus two (so the first derivative is two times the input). Start at the initial guess of 1.5:

x_1 = 1.5 - (1.5^2 - 2) / (2·1.5) = 1.5 - 0.25/3 = 1.41667
x_2 = 1.41667 - (1.41667^2 - 2) / (2·1.41667) ≈ 1.41421

After two steps it agrees with the square root of two, 1.41421, to five decimals. Each step uses the first-order Taylor model of the function at the current guess, and the convergence is rapid because the tangent line is an excellent local stand-in. The most-used root-finder in scientific computing is, underneath, this lesson’s first term.

Taylor is the most structurally important calculus idea for machine learning, because the field’s core algorithms are Taylor approximations of a loss function.

Gradient descent is first-order Taylor. Every training step approximates the loss near the current parameters by its tangent plane, the loss at theta plus the step is about the loss at theta plus the gradient dotted with the step, and moves in the steepest-downhill direction of that linear model. The whole of gradient-based training is repeated first-order Taylor.

Newton’s method is second-order Taylor. It keeps the next term too, building a parabola (using the second-derivative information, the Hessian from the previous lesson) and jumping to that parabola’s bottom. The trade-off between gradient descent and Newton’s method is exactly the trade-off between a first-order and a second-order Taylor model. (Newton’s method for finding a zero of the function is the same idea: set the first-order Taylor approximation to zero and solve, giving the update the current guess minus the function value divided by the first derivative.)

The neural tangent kernel is a first-order Taylor expansion of a network’s output with respect to its parameters at initialization; in the infinite-width limit it makes training analytically tractable and has become a central tool for understanding why deep networks train the way they do. And at the lowest level, when a processor evaluates sine, cosine, the exponential, or the logarithm, it is computing a polynomial, a truncated Taylor-style series, because polynomials are all the hardware can do directly. Taylor is not an incidental tool in machine learning; it is the shape of how the field reasons about functions it cannot handle whole.

Forgetting the derivatives are evaluated at the center. Every coefficient is the k-th derivative of the function at a, the derivative at the expansion point a, not at the variable input. The series approximates the function near a, and the derivatives that build it are all measured there.

Dropping the factorials. The k-th term is the k-th derivative of the function at a divided by k factorial, not the k-th derivative alone. Omitting the factorial breaks the matching property and the series no longer reproduces the function.

Expecting accuracy far from the center. Taylor approximations are local: excellent near the center a, and for some functions they diverge once the input gets too far away. More terms help near the center, not necessarily far from it.

Confusing the approximation with the function. A truncated Taylor series (a few terms) is an approximation; the full infinite series, where it converges, equals the function. Be clear about whether you are using a finite stand-in or the exact expansion.

  • A Taylor series rebuilds a function near a point from its derivatives there: the function at the input is about the value at a, plus the first derivative at a times the input minus a, plus the second derivative at a over two factorial times the input minus a squared, and on. Each term matches one more order of the function’s behavior at the center a; the factorials are what make that matching exact.
  • The classic series follow from earlier lessons: Euler’s number raised to the input is one plus the input plus the input squared over two factorial and on (because every derivative of Euler’s number raised to the input is itself), and sine and cosine give the alternating odd and even-power series. The small-angle rule that the sine of the input is about the input is just the first-order Taylor term, and L’Hôpital’s rule is the first-order Taylor ratio.
  • Taylor is the calculus idea machine learning leans on most: gradient descent is a first-order Taylor step, Newton’s method is second-order, the neural tangent kernel is a first-order Taylor expansion of a network, and processors compute sine, the exponential, and friends as Taylor-style polynomials.

You opened this track unable to say why a circle’s area is pi times the radius squared. You can now derive it, and far more: you have both halves of calculus, the rules that differentiate the functions you meet, the theorem that ties rates to accumulation, and finally the Taylor series that reconstructs any of those functions from its derivatives at a point. That last idea is the one the machine-learning papers assume you carry: when they write a gradient step, a second-order method, or a tangent-kernel argument, they are speaking Taylor. The arc that began with a circle closes here, with a single polynomial standing in for any function you like, built from nothing but the rates you now know how to find.