Skip to content

From sample to population: sampling and the central limit theorem

The opening lesson made a quiet promise: the reason AI reasons under uncertainty is that it learns from a sample, never the whole world. This final phase cashes that promise in. Every number you measure on a sample, a model’s accuracy on a test set, the average of a survey, a conversion rate, is an estimate. Measure it on a different sample and you would get a slightly different answer. The questions of this phase are: how much does an estimate vary, and what can it tell you about the truth behind it? The single idea that makes those questions answerable is the central limit theorem, and this lesson builds up to it.

Four words, two pairs, and the whole phase rests on keeping them straight.

  • The population is everything you care about: all users, all possible inputs a model could see, every voter.
  • A sample is the subset you actually measure: the 1,000 users you surveyed, the 5,000 examples in your test set.
  • A parameter is a true number about the population: the true average, the true accuracy. Usually unknown.
  • A statistic is the matching number computed from your sample: the sample average, the measured accuracy. Known, but only an estimate of the parameter.

The entire game of inference is using a statistic (what you can measure) to say something about a parameter (what you actually want to know). The catch is that the statistic is not the parameter; it is an estimate that comes with uncertainty.

Sampling variability and the sampling distribution

Section titled “Sampling variability and the sampling distribution”

Here is the key realization: a sample statistic is itself a random variable. Draw a different random sample and the sample mean comes out a little different, just by the luck of who landed in the sample. This sample-to-sample wobble is sampling variability.

If you imagine taking every possible sample of a given size and computing the statistic for each, the distribution of all those values is the sampling distribution of the statistic. It has two features that matter enormously:

  • Its center sits at the true parameter. On average, the sample mean is the population mean; the estimate is not systematically too high or too low (it is unbiased).
  • Its spread measures how much the estimate bounces around. That spread is called the standard error, and it is the heart of inference: a small standard error means your estimate is precise, a large one means it is shaky.

The standard error and the square-root law

Section titled “The standard error and the square-root law”

For a sample mean, the standard error has a clean formula. If the population has standard deviation sigma and your sample has size n, then

standard error of the mean = sigma / square root of n

The standard error is the standard deviation of the data divided by the square root of the sample size. Two things fall out of that, both important:

Population standard deviation sigma = 20, sample size n = 100:
standard error = 20 / sqrt(100) = 20 / 10 = 2
Quadruple the sample to n = 400:
standard error = 20 / sqrt(400) = 20 / 20 = 1 (halved)

First, bigger samples give more precise estimates: the standard error shrinks as n grows. Second, it shrinks with the square root of n, not n itself, so to halve your error you must quadruple your sample. This square-root law is why the first bit of data helps a lot and the millionth data point barely moves the needle. It is the mathematics behind “more data helps, but with diminishing returns.”

Three sampling distributions of the sample mean, narrowing as sample size grows from 5 to 30 to 100 Three small panels side by side, each showing a bell curve for the sampling distribution of the sample mean for a population with mean 5 and standard deviation 4, evaluated at sample size n equals 5, 30, and 100 respectively. The leftmost panel for n equals 5 shows the widest bell with standard error 1.79. The middle for n equals 30 is narrower with standard error 0.73. The rightmost for n equals 100 is the tightest bell with standard error 0.40. All three are centered on the true mean of 5 marked by a vertical purple dashed line. As n grows, the spread shrinks; the central limit theorem makes each bell more normal-shaped too. n = 5 μ = 5 SE = σ/√n = 1.79 density of x̄ n = 30 μ = 5 SE = σ/√n = 0.73 n = 100 μ = 5 SE = σ/√n = 0.40
The sampling distribution of the sample mean narrows as the sample size grows. From a population with μ = 5 and σ = 4, samples of size 5 yield x̄ values scattered with SE ≈ 1.79; at n = 30 the scatter shrinks to 0.73; at n = 100 it tightens to 0.40. All three bells center on the same true mean. Larger samples produce a closer estimate, and the bell shape emerges regardless of the original population shape.

Now the payoff, and the reason the normal distribution kept reappearing. The central limit theorem (CLT) says:

For a large enough sample, the sampling distribution of the mean is approximately normal, no matter what shape the original population has.

Read that again, because it is close to magical. The population can be wildly skewed (incomes), lumpy, or bimodal; take samples and average them, and the averages pile up into a bell curve centered on the true mean with standard error sigma over root n. This is the deep answer to the question left open in the normal-distribution lesson, why is the bell everywhere: because so many quantities in the world are sums or averages of many small independent pieces, and the CLT pulls all of those toward the normal.

Standard error as a function of sample size: SE equals sigma over square root n, shrinking along a one-over-square-root curve A coordinate plot with sample size n on the horizontal axis from 1 to 200 and standard error on the vertical axis from 0 to about 4. An accent purple curve traces SE equals 4 over square root of n, falling steeply at first then leveling off. Three amber circles at n equals 5, 30, 100 mark the same checkpoints used in the L11 sampling-distribution picture, with values 1.79, 0.73, and 0.40 labeled. A legend on the right notes that doubling n cuts SE by about 1.41, and quadrupling n halves it. n SE 1 30 60 100 150 200 1 2 3 4 n=5, SE=1.79 n=30, SE=0.73 n=100, SE=0.40 standard error rule: SE = σ / √n double n → SE × 1/√2 ≈ 0.71 quadruple n → SE × 1/2 = 0.50
Standard error falls as one over the square root of n, so the curve drops steeply at first and then levels off. To halve your uncertainty, you need to quadruple your sample. The three amber checkpoints (n = 5, 30, 100) match the three bells in the companion CLT picture; that is where 1.79, 0.73, and 0.40 come from.

The CLT is what makes the rest of this phase work. Because sample means are approximately normal, you can use the 68-95-99.7 rule and z-scores on them, which is exactly what confidence intervals (next lesson) and hypothesis tests (the one after) do. Without the CLT, inference would need a different theory for every oddly-shaped population; with it, the normal handles them all.

This is not abstract; it is the foundation under how AI is evaluated.

  • A test-set metric is a sample estimate. A model’s accuracy on a test set is a statistic, an estimate of its true accuracy on all future data (the parameter). Measure it on a different test set and you would get a slightly different number. The accuracy you report has a standard error, and a smaller test set means a shakier number.
  • Why more data tightens an estimate. The square-root law explains why a bigger test set gives a more trustworthy accuracy and why doubling it helps less than the first doubling. It is also why tiny test sets produce accuracy numbers you should not over-trust.
  • It is what lets you compare models. Because the difference between two sample metrics is itself approximately normal (thanks to the CLT), you can ask whether model B is really better than model A or just luckier on this test set, which is the hypothesis test two lessons from now, and the heart of A/B testing.

When you see “95% accurate on the test set,” the honest reading is “our best estimate is 95%, give or take a standard error.” This lesson is why that give-or-take exists and how big it is.

  • Confusing the standard deviation of the data with the standard error of the mean. The data has spread sigma; the sample mean has spread sigma over root n, which is smaller. Averaging cancels noise, so the estimate is tighter than the raw data.
  • Thinking a bigger sample makes the data less spread out. It does not change the spread of the data; it shrinks the spread of the estimate. More data makes you more sure of the average, not the individuals more alike.
  • Treating the sample statistic as the exact truth. It is an estimate with a standard error. “95% accurate” is a point estimate, not the true accuracy to the decimal.
  • Forgetting the CLT needs a large enough sample. For very small samples drawn from very skewed populations, the sample mean may not be close to normal yet. The CLT is a large-sample result.
  • A statistic (what you measure on a sample) estimates a parameter (the true population value you want); the statistic is not the parameter, it is an estimate with uncertainty.
  • A statistic is a random variable: different samples give different values. The distribution of those values is the sampling distribution, centered on the true parameter.
  • The standard error of the mean is sigma over root n: bigger samples give tighter estimates, but only with the square root of n, so halving the error takes four times the data.
  • The central limit theorem: for large samples, the sample mean is approximately normal regardless of the population’s shape. This is why the bell is everywhere and what makes the rest of inference possible.
  • In AI, a test-set metric is a sample estimate with a standard error; more data tightens it (with diminishing returns), and the CLT is what lets you put error bars on metrics and compare models.