Skip to content

Cheatsheet: Testing a claim: hypothesis testing and p-values

A hypothesis test asks whether an effect is real or noise. The p-value is P(data this extreme | null true), NOT the probability the null is true.

Null (H0): the skeptical default -- no effect, no difference (fair coin, no improvement).
Alternative (H1): there is a real effect.
Logic: assume H0, ask how surprising the data is. Very unlikely under H0 -> evidence against H0.
100 flips, 63 heads. H0: fair (p = 0.5).
expected = 50; sd = sqrt(100 x 0.5 x 0.5) = 5
z = (63 - 50) / 5 = 2.6 standard deviations out
p (data this extreme if fair) is about 0.01 -> below 0.05 -> reject "fair"
p-value = P(data at least this extreme | null is true).
Pick alpha (often 0.05) in advance. p < alpha -> reject null = "statistically significant".
p >= alpha -> fail to reject (NOT proof the null is true).
1. p is NOT P(null is true). It is P(data | null) -- the flipped conditional (Bayes trap).
2. "Significant" is NOT "large/important". Check the EFFECT SIZE; big samples make tiny effects significant.
3. "Not significant" is NOT "no effect". Could be a real effect the test was too small to detect.
+ Multiple testing: run 20 tests, ~1 hits p<0.05 by chance. Be honest about how many you ran.
  • A/B testing: is the new model/feature’s lift real or noise? (null = “no better”).
  • Benchmark gaps: is B’s higher score significant given the test-set size? (the overlapping-CI check, formalized).
  • Replication: trying many variants and reporting the best is multiple testing -> false positives.
  • Sample size decides: 65% vs 60% is noise on n=200 (z1.4), significant on n=2000 (z4.6).
  • Reading p as the probability the null is true (flipped conditional).
  • Equating significance with importance (ignoring effect size).
  • Treating “not significant” as “no effect.”
  • Ignoring how many tests were run (multiple comparisons).
  • Null hypothesis (H0): the no-effect default the test assumes.
  • p-value: P(data at least this extreme | null true).
  • Significance level (alpha): the pre-chosen threshold (often 0.05).
  • Effect size: how big the difference is (separate from whether it is significant).