Cheatsheet: Testing a claim: hypothesis testing and p-values
The one idea
Section titled “The one idea”A hypothesis test asks whether an effect is real or noise. The p-value is P(data this extreme | null true), NOT the probability the null is true.
Setup and logic
Section titled “Setup and logic”Null (H0): the skeptical default -- no effect, no difference (fair coin, no improvement).Alternative (H1): there is a real effect.Logic: assume H0, ask how surprising the data is. Very unlikely under H0 -> evidence against H0.Worked example (coin)
Section titled “Worked example (coin)”100 flips, 63 heads. H0: fair (p = 0.5). expected = 50; sd = sqrt(100 x 0.5 x 0.5) = 5 z = (63 - 50) / 5 = 2.6 standard deviations out p (data this extreme if fair) is about 0.01 -> below 0.05 -> reject "fair"The p-value and the threshold
Section titled “The p-value and the threshold”p-value = P(data at least this extreme | null is true).Pick alpha (often 0.05) in advance. p < alpha -> reject null = "statistically significant".p >= alpha -> fail to reject (NOT proof the null is true).The three misreadings to REFUSE
Section titled “The three misreadings to REFUSE”1. p is NOT P(null is true). It is P(data | null) -- the flipped conditional (Bayes trap).2. "Significant" is NOT "large/important". Check the EFFECT SIZE; big samples make tiny effects significant.3. "Not significant" is NOT "no effect". Could be a real effect the test was too small to detect.+ Multiple testing: run 20 tests, ~1 hits p<0.05 by chance. Be honest about how many you ran.In machine learning
Section titled “In machine learning”- A/B testing: is the new model/feature’s lift real or noise? (null = “no better”).
- Benchmark gaps: is B’s higher score significant given the test-set size? (the overlapping-CI check, formalized).
- Replication: trying many variants and reporting the best is multiple testing -> false positives.
- Sample size decides: 65% vs 60% is noise on n=200 (z
1.4), significant on n=2000 (z4.6).
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Reading p as the probability the null is true (flipped conditional).
- Equating significance with importance (ignoring effect size).
- Treating “not significant” as “no effect.”
- Ignoring how many tests were run (multiple comparisons).
Words to use precisely
Section titled “Words to use precisely”- Null hypothesis (H0): the no-effect default the test assumes.
- p-value: P(data at least this extreme | null true).
- Significance level (alpha): the pre-chosen threshold (often 0.05).
- Effect size: how big the difference is (separate from whether it is significant).