The bell curve: the normal distribution

Back in the histogram lesson you met the bell shape and were told it was important enough to get its own lesson. This is that lesson. The bell curve, properly the normal distribution, is the most important distribution in statistics: it describes heights, measurement errors, test scores, and, as you will see in the next phase, the averages of almost anything. Here we make it precise, learn the one rule that makes it instantly usable, and connect it to the standardization you already met.

Probability as area under a curve

First, a shift from the previous lesson. A discrete random variable (a die, a count) has listable values, each with its own probability. A continuous random variable (a height, a time, a measurement) can take any value in a range, so no single exact value has a probability you can pin down; there are infinitely many. Instead, probability is area under a curve.

The curve is called a density curve, and the rule is simple: the total area under it is 1 (something happens), and the probability that the value falls in a range is the area over that range. You stop asking “what is the probability of exactly this value” and start asking “what is the probability of landing in this interval,” which is an area.

The normal distribution

The normal distribution is a specific, symmetric, bell-shaped density curve:

            .-"""-.
          .'       '.
         /           \
        /             \
      .'               '.
   _.'                   '._
  --------|-------|-------|--------
        mean-s   mean   mean+s

It is pinned down by just two numbers:

the mean (the center, where the peak sits), and
the standard deviation (how wide the bell is).

Change the mean and the whole bell slides left or right; change the standard deviation and it gets wider and flatter or narrower and taller. Every normal distribution has the same shape, just shifted and scaled. The mean and standard deviation here are exactly the expected value and spread from the previous lesson, now describing a continuous bell.

The 68-95-99.7 rule

The single most useful fact about the normal distribution is how its area is divided up by standard deviations from the mean. This is the empirical rule:

about 68% of the values fall within 1 standard deviation of the mean
about 95% fall within 2 standard deviations
about 99.7% fall within 3 standard deviations

That is almost everything you need to reason about a normal distribution without a calculator. Take test scores that are normally distributed with a mean of 500 and a standard deviation of 100:

about 68% of scores fall between 400 and 600   (mean +/- 1 sd)
about 95% fall between 300 and 700             (mean +/- 2 sd)
about 99.7% fall between 200 and 800           (mean +/- 3 sd)

So a score of 700 is two standard deviations above the mean, out at the edge where only about 2.5% of scores are higher (half of the 5% that fall outside two standard deviations). The rule turns “how unusual is this value” into quick arithmetic.

The standard normal curve carries its own ruler. About 68 percent of all values fall within one standard deviation of the mean, 95 percent within two, and 99.7 percent within three. The picture is the rule: three nested regions whose areas read off as the familiar 68 / 95 / 99.7 percentages.

The z-score: how many standard deviations from the mean

To make “two standard deviations above the mean” into a precise, comparable number, you compute the z-score: how many standard deviations a value sits above (positive) or below (negative) the mean.

z = (value - mean) / standard deviation

This is exactly the standardization from the center-and-spread lesson, now with a name and a use. For the test scores (mean 500, standard deviation 100):

score 600:  z = (600 - 500) / 100 = +1.0   (one sd above the mean)
score 700:  z = (700 - 500) / 100 = +2.0   (two sd above)
score 450:  z = (450 - 500) / 100 = -0.5   (half an sd below)

The z-score’s power is comparability. A z of +2 means “near the top, about the 97.5th percentile” whether the underlying scale is test scores, heights, or dollars, because it strips away the original units. Combining a z-score with the empirical rule: a value at z = +1 has about 84% of the distribution below it (the 50% below the mean plus the 34% between the mean and one standard deviation above).

To standardize is to relabel the x-axis. The raw curve N(500, 100²) and the standard normal N(0, 1) have exactly the same shape. The transformation z = (x - μ)/σ slides and rescales the axis so the mean lands at 0 and the standard deviation at 1. A value at x = 650 in the raw scale lands at z = 1.5 in the standard scale: same height, same percentile, different number.

Why this matters when you use AI

The normal distribution and the z-score are woven through machine learning.

Standardizing features. The standardization from Phase 1, subtract the mean and divide by the standard deviation, is literally computing a z-score for each feature value. Many models train better when features are standardized this way, and the result is read as “how many standard deviations from typical.”
The default model of noise and initialization. Random error in measurements and many natural quantities are modeled as normal (“Gaussian”). The initial weights of a neural network are often drawn from a normal distribution. When a system needs a generic “random,” the normal is the usual choice.
Outlier detection. A simple, common rule flags values more than two or three standard deviations from the mean (a large z-score) as unusual, straight from the empirical rule.
Why it is everywhere. The deep reason is the next phase’s punchline: the averages and sums of many independent random things tend toward a normal distribution, no matter what they started as. That is why measurement errors and aggregated quantities are so often bell-shaped, and it is what makes the normal the workhorse it is.

A caution that matters: not everything is normal. The skewed distributions from the histogram lesson (incomes, response times) are not bell-shaped, and applying the 68-95-99.7 rule to them gives wrong answers. The normal is powerful and common, but it is a model you check, not an assumption you make blindly.

Common pitfalls

Assuming everything is normal. Skewed and bimodal data exist and are common; the empirical rule and z-score interpretations only hold for roughly normal distributions. Check the shape (a histogram) first.
Confusing the curve’s height with a probability. For a continuous distribution, probability is area over an interval, not the height at a point. No single exact value carries a probability.
Forgetting a z-score needs both the mean and the standard deviation. “Two points above the mean” means nothing until you know the standard deviation; z = (value - mean) / standard deviation requires both.
Reading 99.7% as “all.” The normal’s tails extend forever; values beyond three standard deviations are rare, not impossible, which matters when rare extreme events are the ones you care about.

What you should remember

For a continuous distribution, probability is area under a density curve; the total area is 1 and a range’s probability is its area.
The normal distribution is the symmetric bell, pinned down by its mean (center) and standard deviation (width); change them to slide and scale the same shape.
The 68-95-99.7 rule: about 68%, 95%, and 99.7% of values fall within 1, 2, and 3 standard deviations of the mean. It makes the normal usable at a glance.
The z-score z = (value - mean) / standard deviation is how many standard deviations a value is from the mean, the same standardization as before, and it makes values comparable across different scales.
In AI the normal underlies feature standardization (z-scores), the default model of noise and weight initialization, and outlier detection; but not all data is normal, so check the shape before trusting the bell.