Skip to content

Summary: Testing a claim: hypothesis testing and p-values

Hypothesis testing decides whether an observed effect is real or noise, and the p-value is its central, most-abused number. Confidence intervals hinted that a difference might be within the noise; this lesson makes the call formal. The mechanics are simple; the interpretation is where almost everyone goes wrong, so most of the lesson guards against three specific misreadings. This summary is the scan-in-five-minutes version of the full lesson.

  • Two hypotheses. The null (H0) is the skeptical default: no effect, no difference (fair coin, no improvement). The alternative (H1) is the claim of a real effect. The test assumes the null, innocent until proven guilty.
  • The logic. Assume the null, then ask: how likely is data at least this extreme? Very unlikely under the null is evidence against it. A coin giving 63 of 100 heads is 2.6 standard deviations from the expected 50, so the data is surprising for a fair coin.
  • The p-value. P(data at least this extreme, given the null is true). Small p means surprising under the null. Below the threshold alpha (often 0.05) you reject the null and call it statistically significant. The coin’s p is about 0.01, so you reject “fair.”
  • What it is NOT (the three errors). (1) p is not the probability the null is true, that flips the conditional (the Bayes-lesson trap). (2) Statistically significant is not large or important, check the effect size. (3) Failing to reject is not proof of the null, absence of evidence is not evidence of absence.
  • The multiple-testing trap. Run many tests and some look significant by chance (about 1 in 20 at alpha 0.05). Reporting only the winner of many tries manufactures false positives.
  • In AI. This is A/B testing and benchmark comparison: is the new model’s lift real or noise? Sample size decides (a 65-vs-60 gap is noise on 200 queries, significant on 2000). Ask: significant at what threshold, how big the effect, how many things were tried.

You gain the tool for the question that keeps coming up, “is this difference real?”, and, more valuable, the discipline to read “significant” correctly. When you hear “our new model is significantly better,” you now ask three things automatically: significant at what threshold, how large is the actual effect, and how many variants were tried before this one won? You refuse the seductive misreading that a small p-value is the probability the result is a fluke (it is the flipped conditional from the Bayes lessons), and you remember that a significant result can be trivially small and a non-significant one can hide a real effect the test was too small to see. That skepticism is exactly what the opening lesson called the discipline of not fooling yourself.