Practice: Why e is special
Self-check
Section titled “Self-check”Six short questions. Answer each one in your head (or on paper) before opening the collapsible. Trying to retrieve the answer is where the learning sticks; rereading feels productive but does much less.
1. What actually defines e? (Not its digits.)
Show answer
e is the unique base for which the exponential is its own derivative: d/dx(e^x) = e^x, the rate of change equals the value at every point. The decimal e ≈ 2.71828 is a consequence of that property, not the definition.
2. What is the derivative of a general exponential a^x?
Show answer
d/dx(a^x) = M(a) · a^x, the function itself times a constant M(a) (which is ln a) determined only by the base. Factoring a^(x+h) = a^x · a^h pulls a^x out of the limit, leaving a constant that does not depend on x. The shape is always “a copy of the function”; the base only sets the multiplier.
3. Why does e land between 2 and 3?
Show answer
The multiplier M(a) is below 1 at base 2 (M(2) ≈ 0.693) and above 1 at base 3 (M(3) ≈ 1.099), so somewhere between them it passes through exactly 1. The base where the multiplier equals 1, making the exponential its own derivative, is e ≈ 2.71828.
4. Using the chain rule, what is d/dx(e^(kx)), and what equation does e^(kx) solve?
Show answer
d/dx(e^(kx)) = k · e^(kx) (outer e^u derivative e^u, inner kx derivative k). So e^(kx) is the solution to f'(x) = k · f(x), “the rate of change is proportional to the current value.” That equation governs compound interest, population growth, radioactive decay, and charging circuits.
5. Why is e everywhere in machine learning?
Show answer
Because probability and smooth proportional change both run on the exponential. Softmax (e^(x_i) normalized by the sum of e^(x_j)) is the output of essentially every classifier; the sigmoid (1/(1 + e^(-x))) was the classic activation, with a clean derivative σ(1-σ) from the chain rule on e^(-x); and continuous-time models (neural ODEs, diffusion) solve f' = (something)·f, whose solutions are exponentials.
6. Why can’t you use the power rule on e^x?
Show answer
Because e^x and x^n are different kinds of function. In x^n the variable is the base and the exponent is fixed (power rule applies). In e^x the base is fixed and the variable is the exponent (the self-derivative property applies). Applying the power rule to e^x is a category error.
Try it yourself, part 1: check the property, then differentiate
Section titled “Try it yourself, part 1: check the property, then differentiate”Pen and paper (a calculator helps), about 7 minutes.
(a) Confirm the self-derivative property numerically at x = 2: approximate the slope as (e^(2+h) - e^2)/h with h = 0.001, and compare to e^2. (Use e^2 ≈ 7.389.)
(b) Differentiate each: e^(5x), e^(-2x), e^(x³).
Show answer
(a) e^2 ≈ 7.389, and e^(2.001) = e^2 · e^(0.001) ≈ 7.389 · 1.0010005 ≈ 7.3964.
slope ≈ (7.3964 - 7.389) / 0.001 ≈ 0.0074 / 0.001 ≈ 7.39That matches e^2 ≈ 7.389 (the tiny excess is just the forward-difference approximation; it shrinks as h does). Slope equals value, exactly the self-derivative property.
(b) Use d/dx(e^(kx)) = k·e^(kx) (and the chain rule for the third):
d/dx(e^(5x)) = 5·e^(5x)d/dx(e^(-2x)) = -2·e^(-2x)d/dx(e^(x³)) = 3x²·e^(x³)The constant in the exponent comes down as a factor; only the bare e^x (with k = 1) is unchanged.
Try it yourself, part 2: rate proportional to value
Section titled “Try it yourself, part 2: rate proportional to value”About 3 minutes. A bank balance grows continuously at 5% per year, meaning its rate of change is 0.05 times the current balance: f'(t) = 0.05·f(t). Starting from $100, the balance is f(t) = 100·e^(0.05t). Verify that this f satisfies the equation, and say in words what it means.
Show answer
Differentiate f(t) = 100·e^(0.05t) using d/dx(e^(kt)) = k·e^(kt):
f'(t) = 100 · 0.05 · e^(0.05t) = 5·e^(0.05t)0.05·f(t) = 0.05 · 100·e^(0.05t) = 5·e^(0.05t)They are equal, so f'(t) = 0.05·f(t), the equation holds. In words: at every instant the balance grows by 5% of whatever it currently is, so a bigger balance grows faster, which is exactly self-proportional (exponential) growth. The same e^(kt) shape, with k < 0, describes radioactive decay; with other k, population growth or a discharging capacitor. e is the fingerprint of “rate proportional to value.”
Flashcards
Section titled “Flashcards”Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.
Q. What defines e (not its digits)?
e is the unique base for which the exponential is its own derivative: d/dx(e^x) = e^x. The decimal e ≈ 2.71828 is a consequence of that behavior, not the definition.
Q. What is the derivative of a general exponential a^x?
d/dx(a^x) = M(a)·a^x, the function times a base-dependent constant M(a) = ln a. Factoring a^(x+h) = a^x·a^h pulls a^x out of the limit; the leftover is a constant independent of x.
Q. Why does e land between 2 and 3?
The multiplier M(a) is below 1 at base 2 (≈ 0.693) and above 1 at base 3 (≈ 1.099), so it crosses exactly 1 somewhere between. That crossing base, where the exponential is its own derivative, is e ≈ 2.71828.
Q. What is d/dx(e^(kx)), and what equation does e^(kx) solve?
d/dx(e^(kx)) = k·e^(kx) (chain rule: outer e^u gives e^u, inner kx gives k). So e^(kx) solves f'(x) = k·f(x), “rate of change proportional to current value.”
Q. Why is 'rate proportional to value' everywhere?
Because that pattern (f' = k·f) describes compound interest, population growth, radioactive decay (k < 0), and charging/discharging circuits. Every such process follows e^(kt), so e is the natural shape of self-proportional change.
Q. What is the sigmoid's derivative, and why is it cheap?
σ(x) = 1/(1 + e^(-x)) has derivative σ'(x) = σ(x)·(1 - σ(x)), which falls out of the chain rule on e^(-x). Once σ(x) is computed in the forward pass, its derivative costs almost nothing, which made it convenient to train with.
Q. Why can't you use the power rule on e^x?
x^n has the variable in the base (power rule); e^x has the variable in the exponent (self-derivative property). They are different kinds of function with different rules; applying the power rule to e^x is a category error.
Q. Where is e central in machine learning?
Softmax (e^(x_i) normalized) is the output of essentially every classifier, including next-token prediction; the sigmoid activation is built from e^(-x); and continuous-time models (neural ODEs, diffusion) solve f' = (something)·f, whose solutions are exponentials.
Q. Is every exponential its own derivative?
No, only base e. For base 2, d/dx(2^x) = ln(2)·2^x, with an extra factor of about 0.693. The self-derivative property is exactly what singles e out from all other bases.