Summary: From sample to population: sampling and the central limit theorem

Every number measured on a sample is an estimate that would come out differently on another sample, and the central limit theorem is what lets you reason about that uncertainty. AI learns from a sample, never the whole world, so a model’s measured accuracy is an estimate of its true accuracy, with wobble. This lesson is the bridge into inference: what varies, by how much, and why the normal distribution comes to the rescue. This summary is the scan-in-five-minutes version of the full lesson.

Core ideas

Four words, two pairs. The population is everything you care about; a sample is the subset you measure. A parameter is a true (unknown) population number; a statistic is its sample estimate. Inference uses the statistic to learn about the parameter.
Statistics vary. A sample statistic is a random variable: another sample gives another value (sampling variability). The distribution of the statistic over all possible samples is its sampling distribution, centered on the true parameter.
The standard error. The spread of the sampling distribution. For a sample mean it is sigma over root n. A small standard error means a precise estimate.
The square-root law. The standard error shrinks with the square root of n, so halving the error takes four times the data. (sigma = 20: n = 100 gives SE 2; n = 400 gives SE 1.) More data helps, with diminishing returns.
The central limit theorem. For large samples, the sample mean is approximately normal regardless of the population’s shape. This is the deep reason the bell is everywhere (many quantities are sums or averages of small independent pieces), and it is what lets z-scores and the empirical rule apply to estimates.
In AI. A test-set metric is a sample estimate with a standard error; more data tightens it (with diminishing returns); and because differences of sample metrics are approximately normal, the CLT is what lets you put error bars on a metric and compare two models.

What changes for you

You stop reading a measured number as the exact truth and start reading it as an estimate with a give-or-take. “95% accurate on the test set” becomes “our best estimate is about 95%, with an error bar set by how big the test set was.” You understand why more data helps and why it helps less and less (the square-root law), so you can judge when a bigger sample is worth it. And you have the one theorem that powers the rest of the phase: because sample means go normal, the confidence intervals and hypothesis tests coming next have a foundation that works no matter how strange the underlying data is.