Testing a claim: hypothesis testing and p-values

The previous lesson showed that two results with overlapping confidence intervals cannot be told apart. But often you have to make the call anyway: is the new model really better than the old one, or did it just get lucky on this test set? Is this coin biased, or did a run of heads just happen? Hypothesis testing is the formal machinery for deciding whether an observed effect is real or noise, and the p-value is its central output, a number as widely used as it is widely misunderstood. Getting it right is one of the most valuable things in this whole track.

The setup: two hypotheses

A hypothesis test pits two claims against each other.

The null hypothesis (written H0) is the boring default: nothing is going on, there is no effect, no difference. The coin is fair. The new model is no better than the old. The drug does nothing.
The alternative hypothesis (H1) is what you suspect instead: there is an effect, a difference, a bias. The coin is not fair. The new model is genuinely better.

The test is built to be skeptical. It assumes the null is true and asks whether the data is surprising enough to abandon that assumption, the statistical version of “innocent until proven guilty.” You never start by assuming your exciting claim; you start by assuming the boring one and see if the data forces you off it.

The logic: how surprising is the data?

The whole test rests on one question: if the null hypothesis were true, how likely would we be to see data at least as extreme as what we actually saw? If the answer is “very likely,” the data is consistent with the null and you have learned nothing surprising. If the answer is “very unlikely,” then either something rare happened or the null is wrong, and the rarer the data, the stronger the case against the null.

Make it concrete. You flip a coin 100 times and get 63 heads. Is it fair? Assume the null: the coin is fair, so heads has probability 0.5.

Under "fair coin": expected heads = 100 x 0.5 = 50
Spread (binomial standard deviation) = sqrt(100 x 0.5 x 0.5) = sqrt(25) = 5
How far out is 63?  z = (63 - 50) / 5 = 2.6 standard deviations above expected

(The standard deviation here is the binomial spread from the counts lesson, and the central limit theorem is what lets us treat the count as normal.) A result 2.6 standard deviations out is rare: the probability of landing at least that far from 50 on a fair coin is only about 0.01. That is the p-value.

The p-value, carefully

The p-value is:

the probability of observing data at least as extreme as what you saw, assuming the null hypothesis is true.

In the coin example, p is about 0.01: if the coin really were fair, you would see a result this lopsided only about 1% of the time. That is surprising enough to doubt the null. To make the decision a rule, you pick a threshold in advance called the significance level (alpha), conventionally 0.05. If p is below alpha, you reject the null and call the result statistically significant; if not, you fail to reject it. Here p is about 0.01, below 0.05, so you reject “fair coin” and conclude the coin is biased.

The p-value is the area in the tail of the null distribution beyond the observed test statistic. With z = 2.6, the shaded right tail covers about 0.5 percent of the area under the null. That is the probability of seeing a result this extreme or more if the null hypothesis were true; a small number suggests the data are unlikely under the null.

That is the mechanics. The hard part, and the reason this lesson exists, is what the p-value does and does not mean.

What the p-value is NOT

Three misreadings cause enormous damage. Learn to refuse all three.

1. The p-value is NOT the probability that the null hypothesis is true. This is the big one, and it is the flipped conditional from the Bayes lessons in disguise. The p-value is the probability of data this extreme given the null is true. It is not the probability that the null is true given the data. Those are different directions, exactly the trap from the conditional-probability lesson. A p of 0.01 does not mean “1% chance the coin is fair”; it means “if the coin were fair, data this extreme would happen 1% of the time.”

2. Statistically significant is NOT the same as large or important. Significance only says an effect is detectable, distinguishable from zero given the data. With a huge sample, a tiny, meaningless difference (a 0.01% accuracy gain) can be statistically significant and still not worth anything. Always ask about the effect size (how big is the difference?) alongside significance (is it real?). A significant result can be trivial.

3. Failing to reject the null does NOT prove the null is true. Absence of evidence is not evidence of absence. A non-significant result might mean there is no effect, or it might mean your sample was too small to detect a real one. You can fail to find a difference without having shown there is none.

A fourth danger, the multiple-testing trap, follows from the probability lessons: if you run many tests, some will look significant by pure chance (at alpha = 0.05, about 1 in 20 even when nothing is going on). Testing twenty model variants and reporting the one that “won” is how false discoveries are manufactured. The fix is to be honest about how many things you tested.

Why this matters when you use AI

Hypothesis testing is the daily bread of evaluating AI, and its misuse is everywhere.

A/B testing. Shipping a new model or feature to half your users and comparing a metric is a hypothesis test: the null is “the new version is no better,” and the p-value tells you whether the observed lift is more than noise. Calling a winner before the test is significant is a classic, costly error.
Benchmark comparisons. When model B beats model A by half a point, the right question is whether that difference is significant given the test-set size, the formal version of the overlapping-intervals check from the previous lesson. Many published leaderboard gaps would not survive it.
The replication trap. AI research that tries many architectures, hyperparameters, and seeds and reports the best is multiple testing. Without accounting for how many things were tried, a “significant” improvement can be a lucky draw, which is part of why some results fail to replicate. The discipline is to pre-register what you are testing and to be honest about the search.

When you read “our new model is significantly better,” the trained response is to ask three things: significant at what threshold, how big is the effect, and how many things did you try before this one worked?

Common pitfalls

Reading the p-value as the probability the null is true. It is the probability of the data given the null is true, not the probability the null is true given the data, the flipped conditional. A small p is evidence against the null, not its probability.
Confusing significant with important. Significance means detectable, not large. Check the effect size; a trivial difference can be significant with enough data.
Treating “not significant” as “no effect.” Failing to reject the null is not proof of it; your test may simply have lacked the power to detect a real effect.
Ignoring multiple testing. Run enough tests and something looks significant by chance. Reporting only the winner of many tries manufactures false positives.

What you should remember

A hypothesis test pits the null (no effect, the skeptical default) against the alternative (a real effect), assumes the null, and asks how surprising the data is.
The p-value is the probability of data at least this extreme given the null is true. A small p means the data is surprising under the null, which is evidence against it; below the threshold alpha (often 0.05) you call it statistically significant.
The p-value is not the probability the null is true (that is the flipped conditional from the Bayes lessons), statistically significant is not the same as important (check the effect size), and failing to reject is not proof of the null.
Beware the multiple-testing trap: enough tests will turn up “significance” by chance, so be honest about how many you ran.
In AI this is A/B testing and benchmark comparison; ask significant-at-what-threshold, how-big-is-the-effect, and how-many-things-did-you-try before trusting a “significant” win.