Why e is special: its own derivative

You have met Euler’s number, about 2.71828, and probably been told it is “important,” without ever being told why. Why does this specific irrational number get its own letter, when 2.71828 looks no more special than any other decimal? The answer has nothing to do with its digits. Euler’s number is defined by a behavior, and once you see the behavior, the number stops being arbitrary and starts being inevitable.

The behavior is this: Euler’s number is the one base for which the exponential function is its own derivative.

d/dx( e^x ) = e^x

The rate of change of Euler’s number raised to the input, at every point, equals its value at that point. No other base does this cleanly. This lesson shows why such a base must exist, why it lands between 2 and 3, and why this one property makes Euler’s number appear across nature and across machine learning.

The derivative of any exponential

Start general. Take any base (call it the base) and differentiate the base raised to the input from the limit definition:

d/dx( a^x ) = lim (h->0) ( a^(x+h) - a^x ) / h

The key move is that the base raised to the input plus a small step equals the base raised to the input times the base raised to that small step, so the base raised to the input factors out of the whole expression:

d/dx( a^x ) = a^x · lim (h->0) ( a^h - 1 ) / h

That remaining limit does not depend on the input at all; it is just some constant determined by the base. Call it the multiplier M of the base. So for every exponential:

d/dx( a^x ) = M(a) · a^x

The derivative of an exponential is the exponential itself, times a constant that depends only on the base. The shape is always “a copy of the function,” and the only thing the base controls is the size of the multiplier out front.

The multiplier lands on 1 somewhere between 2 and 3

Now look at that multiplier for a few bases (these constants turn out to be the natural logarithm of the base, but you can read them off numerically):

For base 2: the multiplier is about 0.693. So the derivative of 2 raised to the input is about 0.693 times 2 raised to the input. The multiplier is less than 1.
For base 3: the multiplier is about 1.099. So the derivative of 3 raised to the input is about 1.099 times 3 raised to the input. The multiplier is more than 1.

The multiplier was below 1 at base 2 and above 1 at base 3, so somewhere between them it passes through exactly 1. The base at which the multiplier equals 1 is what we call Euler’s number, and it comes out to about 2.71828. At that base, and only that base:

d/dx( e^x ) = 1 · e^x = e^x

That is the definition of Euler’s number. Not “2.71828 because someone said so,” but “the base that makes the multiplier exactly 1, so that the exponential is its own derivative.” The decimal is a consequence; the behavior is the definition.

See it in numbers

Test the self-derivative property with the limit and a small step. The slope of Euler’s number raised to the input, at any input, should equal that same value itself.

At input 0: approximate the slope as Euler’s number raised to a small step, minus 1, all divided by that small step. With a step of 0.001, Euler’s number raised to 0.001 is about 1.0010005, so the slope is about 0.0010005 divided by 0.001, which is about 1.0005, essentially 1. And Euler’s number raised to 0 is 1. Slope equals value.
At input 1: the slope comes out to about 2.719, and Euler’s number raised to 1 is Euler’s number itself, about 2.718. Slope equals value again.

Contrast base 2. At input 0, the slope of 2 raised to the input is 2 raised to 0.001, minus 1, divided by 0.001, which is about 0.693, but 2 raised to 0 is 1. The slope 0.693 does not equal the value 1: base 2 is not its own derivative. That mismatch, present for every base except Euler’s number, is exactly what Euler’s number is engineered to remove.

Three exponential curves through (0, 1), each with its tangent line at that point. The tangent slope there equals the natural log of the base: about 0.69 for 2^x (less than 1), about 1.10 for 3^x (greater than 1), and exactly 1 for e^x. That last fact is what makes e special. The curve whose slope equals its own value at every point is e^x, so its rate of growth is always its current size.

The chain rule makes e the answer to a whole family of problems

Combine the self-derivative property with the chain rule from two lessons ago. For Euler’s number raised to a constant times the input, the outer function (Euler’s number raised to the inner) has itself as its derivative, and the inner (the constant times the input) has the constant as its derivative, so:

d/dx( e^(kx) ) = e^(kx) · k = k · e^(kx)

For example, the derivative of Euler’s number raised to 3 times the input is 3 times Euler’s number raised to 3 times the input. This is the formula that makes Euler’s number indispensable, because it is the answer to one of the most common equations in all of science:

f'(x) = k · f(x)

“the rate of change is proportional to the current value.” The function that maps the input to Euler’s number raised to a constant times the input satisfies it exactly, since its derivative is the constant times itself. (As a compound chain-rule check, the derivative of Euler’s number raised to the input squared is 2 times the input, times Euler’s number raised to the input squared, the self-derivative property and the power rule cooperating.)

Why “rate proportional to value” is everywhere

That equation, the derivative equals a constant times the function, describes an enormous range of real processes, and every one of them is governed by Euler’s number:

Compound interest: money grows at a rate proportional to how much you have, so a balance follows Euler’s number raised to a constant times time.
Population growth: more individuals means more reproduction, so an unchecked population grows like Euler’s number raised to a constant times time.
Radioactive decay: atoms decay at a rate proportional to how many remain, giving Euler’s number raised to a negative constant times time (a negative constant for shrinking).
A capacitor charging or discharging: the voltage changes in proportion to the voltage, again an exponential in time.

Anywhere the rate of change is tied to the current amount, Euler’s number raised to something times time is the natural shape of the answer. That is the real reason Euler’s number is “important”: it is the fingerprint of self-proportional change, which is one of the most common patterns in nature.

A worked example that machine learning uses directly

Put Euler’s number and the chain rule together on a function that appears in real neural networks: the sigmoid, 1 divided by the quantity 1 plus Euler’s number raised to the negative input, shown below. Write it as that quantity raised to the negative-1 power and differentiate with the chain rule. The outer function (the inner raised to the negative-1 power) has derivative negative the inner raised to the negative-2 power; the inner (1 plus Euler’s number raised to the negative input) has derivative negative Euler’s number raised to the negative input (the Euler’s-number self-derivative times the inner negative-1). Multiplying:

σ'(x) = -(1 + e^(-x))^(-2) · (-e^(-x)) = e^(-x) / (1 + e^(-x))^2

Now the elegant part. Since the sigmoid equals 1 over the quantity 1 plus Euler’s number raised to the negative input, and 1 minus the sigmoid equals Euler’s number raised to the negative input over that same quantity, their product is Euler’s number raised to the negative input over that quantity squared, which is exactly the sigmoid’s derivative. So:

σ'(x) = σ(x) · (1 - σ(x))

The sigmoid’s derivative is expressible entirely in terms of the sigmoid itself, which is why it was so cheap to train with: once you have computed the sigmoid in the forward pass, its derivative costs almost nothing. That convenience is a direct gift of the self-derivative property of Euler’s number combined with the chain rule.

Why this matters when you use AI

Euler’s number is woven through machine learning more deeply than almost any other constant, because probability and exponentials are inseparable.

Softmax, the function that turns a vector of scores into a probability distribution, is built from Euler’s number: each score becomes Euler’s number raised to that score, and is then normalized by the sum of Euler’s number raised to all the scores. It sits at the output of essentially every classification model, including the next-token prediction in a language model.
The sigmoid, 1 divided by the quantity 1 plus Euler’s number raised to the negative input, squashes any number into the range 0 to 1 and was the classic neuron activation. Its derivative, the sigmoid times 1 minus the sigmoid, falls straight out of the chain rule applied to Euler’s number raised to the negative input, which is why it was so convenient to train with.
Continuous-time models, such as neural differential equations and the score-based diffusion models behind modern image generators, literally solve equations of the form “the derivative equals something times the function,” whose solutions are exponentials. They are Euler’s number in action at the architecture level.

Almost every place a model expresses a probability or a smooth proportional change, Euler’s number is underneath, and it is there because of the one property this lesson defined: the exponential that is its own derivative.

Common pitfalls

Thinking Euler’s number is defined by its digits. The value, about 2.71828, is a consequence, not a definition. Euler’s number is defined as the base where the exponential is its own derivative; the decimal expansion is just what that base happens to equal.

Confusing Euler’s number raised to the input with a power like the input raised to a fixed exponent. In the input raised to a fixed exponent, the variable is the base and the exponent is fixed (use the power rule). In Euler’s number raised to the input, the base is fixed and the variable is the exponent (use the self-derivative property). They are different kinds of function with different derivative rules; do not apply the power rule to Euler’s number raised to the input.

Forgetting the chain-rule factor on Euler’s number raised to a constant times the input. The derivative is the constant times Euler’s number raised to a constant times the input, not Euler’s number raised to a constant times the input alone. The constant in the exponent comes down as a factor, exactly as the chain rule requires. Only the bare Euler’s number raised to the input (with the constant equal to 1) is its own derivative unchanged.

Assuming every exponential is its own derivative. Only base Euler’s number is. For base 2, the derivative of 2 raised to the input is the natural log of 2 times 2 raised to the input, with an extra factor of about 0.693. The self-derivative property is what singles Euler’s number out from all other bases.

What you should remember

Euler’s number is defined by behavior, not by its digits: it is the unique base for which the derivative of Euler’s number raised to the input equals Euler’s number raised to the input, the exponential that is its own derivative. The derivative of any exponential is the multiplier times the base raised to the input; Euler’s number is the base that makes that multiplier exactly 1.
The chain rule extends this so that the derivative of Euler’s number raised to a constant times the input is the constant times Euler’s number raised to a constant times the input, which is the solution to “the derivative equals a constant times the function,” that is, “rate of change proportional to current value.” That equation governs compound interest, population growth, radioactive decay, and circuits, so Euler’s number is the natural shape of self-proportional change everywhere.
Euler’s number is central to machine learning through softmax (every classifier’s output), the sigmoid activation and its clean derivative, and continuous-time models, because probability and smooth proportional change both run on the exponential.

Euler’s number was never a magic decimal to memorize. It is the answer to a question, “which base makes the exponential its own derivative,” and that single property is why it threads through growth, decay, probability, and the inner workings of the models you use. The next lesson turns to implicit differentiation, where the derivative rules so far get applied to relationships that are not neatly solved for one variable.