Skip to content

Cheatsheet: From sample to population: sampling and the central limit theorem

A statistic measured on a sample estimates a population parameter, with sample-to-sample wobble. The central limit theorem makes the sample mean approximately normal, which is what makes inference possible.

Population = everything you care about. Sample = the subset you measure.
Parameter = true population number (unknown). Statistic = sample estimate of it (known).
Inference = use the statistic to say something about the parameter.
A statistic is a RANDOM VARIABLE (different sample -> different value = sampling variability).
Sampling distribution = the distribution of the statistic over all possible samples.
center = the true parameter (unbiased)
spread = the STANDARD ERROR
Standard error of the mean = sigma / sqrt(n)
sigma = 20, n = 100 -> SE = 20/10 = 2
sigma = 20, n = 400 -> SE = 20/20 = 1 (4x data -> half the error)
SE shrinks with sqrt(n), not n. Halve the error -> QUADRUPLE the sample.
First data helps a lot; the millionth point barely moves the estimate (diminishing returns).
For a large enough sample, the sampling distribution of the MEAN is approximately NORMAL,
no matter the population's shape (even skewed/bimodal).
=> Why the bell is everywhere; why z-scores and 68-95-99.7 apply to estimates;
the foundation for confidence intervals and hypothesis tests.
  • A test-set metric (accuracy, etc.) is a sample estimate with a standard error.
  • More test data -> smaller SE -> tighter estimate (diminishing returns).
  • Differences of sample metrics are ~normal -> lets you compare models / run A/B tests.
  • Confusing the data’s spread (sigma) with the estimate’s spread (SE = sigma/sqrt(n), smaller).
  • Thinking more data makes the DATA less spread out (it tightens the ESTIMATE).
  • Treating the statistic as the exact parameter (it is an estimate with error).
  • Forgetting the CLT is a large-sample result (small n + very skewed may not be normal yet).
  • Parameter: a true population value (usually unknown).
  • Statistic: a sample value estimating a parameter.
  • Sampling distribution: the distribution of a statistic across samples.
  • Standard error: the standard deviation of the sampling distribution; sigma/sqrt(n) for the mean.
  • Central limit theorem: sample means are approximately normal for large n, any population shape.