Skip to content

Cheatsheet: Statistics in machine learning

Every tool in this track lands somewhere in the ML workflow, and the biggest payoff is that model evaluation is statistical inference. The through-line: statistics is not fooling yourself about uncertainty.

StageStatistical toolLesson
Understand the datacenter/spread, shape/skew, correlation, class balance2, 3, 4
Read model outputsconditional probability P(label given inputs); correlation not causation6, 4
Train the modelexpected value (loss to minimize, reward to maximize); normal noise8, 9
Evaluate the modelsampling, standard error, confidence interval, hypothesis test11, 12, 13
Read the result honestlybase rates / Bayes, significance vs importance, no causation from correlation1, 7, 13, 4
Test set = a SAMPLE. Metric = a STATISTIC estimating the true value (with a standard error).
Report a CONFIDENCE INTERVAL, not a bare number.
"Is B better than A?" = a HYPOTHESIS TEST (and an A/B test is the same machinery).
The train/test split is, at bottom, a sampling problem.
"94% accurate, significantly better than 92%":
1. On how big a test set? What's the confidence interval? (L11, L12)
2. Is the gap significant at that sample size? (L13)
3. Is 94% good given the base rate / class balance? (L1, L7)
4. Is the improvement meaningful (effect size), not just significant? (L13)
THIS track (statistical thinking): data summaries, outputs as probabilities,
expected-value objectives, EVALUATION AS INFERENCE (CI, hypothesis test, base rates).
CLASSICAL ML track (model-scoring toolkit): confusion matrix, precision/recall,
ROC/AUC, bias-variance tradeoff. Builds on this track; taught there, not here.
  • Reading a metric as exact (it is an estimate with an interval).
  • Confusing significant with meaningful (check effect size).
  • Ignoring the base rate (high accuracy can be worthless on rare targets).
  • Reading a model’s correlations as causes.
Statistics = the discipline of NOT FOOLING YOURSELF about uncertainty.
AI automates inference at scale -> this discipline is how you tell a system that
works from one that only looks like it does.