Practice: Testing a claim: hypothesis testing and p-values

Two skills: running the logic of a test (assume the null, measure how surprising the data is) and interpreting a p-value without the three classic errors. The interpretation drill is the one to nail. Keep a scratchpad.

Self-check

Six short questions. Answer each in your head before opening the collapsible.

1. What are the null and alternative hypotheses?

Show answer

The null (H0) is the skeptical default: no effect, no difference (the coin is fair, the new model is no better). The alternative (H1) is the claim of a real effect. A test assumes the null and asks whether the data is surprising enough to reject it.

2. What is the logic of a hypothesis test in one sentence?

Show answer

Assume the null is true, then ask how likely you would be to see data at least as extreme as what you observed; if that is very unlikely, the data is evidence against the null.

3. Define the p-value precisely.

Show answer

The probability of observing data at least as extreme as what you saw, assuming the null hypothesis is true. A small p means the data would be surprising under the null, which counts as evidence against it. Below the threshold alpha (often 0.05) the result is called statistically significant.

4. Why is “the p-value is the probability the null is true” wrong?

Show answer

Because it flips the conditional. The p-value is P(data this extreme | null true), not P(null true | data), exactly the direction-swap error from the conditional-probability and Bayes lessons. A p of 0.01 does not mean a 1% chance the null is true.

5. Does “statistically significant” mean the effect is large or important?

Show answer

No. Significance means the effect is detectable (distinguishable from zero given the data). With a large enough sample, a tiny, unimportant difference can be significant. Always check the effect size alongside significance.

6. What is the multiple-testing trap?

Show answer

If you run many tests, some will look significant by chance, about 1 in 20 at alpha = 0.05 even when nothing is going on. Testing many variants and reporting only the winner manufactures false positives. Be honest about how many things you tested.

Try it yourself: does sample size change the verdict?

A new model answers a stream of yes/no queries. The old model’s accuracy is known to be 60%. You want to know if the new model is better. In both scenarios the new model scores 65%; only the sample size differs. Use the null “true accuracy = 60%,” for which the expected count is n x 0.6 and the standard deviation is sqrt(n x 0.6 x 0.4).

Scenario 1: 130 correct out of 200 (65%).
Scenario 2: 1300 correct out of 2000 (65%).
For each, compute how many standard deviations above expected the count is.

Show answer

Scenario 1 (n = 200):
  expected = 200 x 0.6 = 120
  sd = sqrt(200 x 0.6 x 0.4) = sqrt(48) = 6.93
  z = (130 - 120) / 6.93 = 10 / 6.93 = 1.44
  -> p (one-sided) is about 0.075, ABOVE 0.05: NOT statistically significant.

Scenario 2 (n = 2000):
  expected = 2000 x 0.6 = 1200
  sd = sqrt(2000 x 0.6 x 0.4) = sqrt(480) = 21.9
  z = (1300 - 1200) / 21.9 = 100 / 21.9 = 4.57
  -> p is tiny (far below 0.05): clearly statistically significant.

Same 65% measured rate, opposite verdicts. On 200 queries, a 65-vs-60 gap is well within the noise; on 2000, the same gap is far outside it. Sample size decides whether a difference is detectable, the square-root law from the sampling lesson showing up directly in a test.

Try it yourself: true or false (the p-value drill)

A test of a new feature returns p = 0.03 (with a significance threshold of 0.05). Mark each statement true or false.

A. "There is a 3% probability that the null hypothesis is true."
B. "If the null were true, we would see data at least this extreme about 3%
   of the time."
C. "The result is statistically significant at the 0.05 level."
D. "Because it is significant, the effect must be large and important."
E. "We tested 20 different features and this is the only one with p < 0.05,
   so we will report just this one as a real effect."

Show answer

A: false. The flipped conditional. p is P(data this extreme | null true), not the probability the null is true.
B: true. This is the correct definition of the p-value.
C: true. 0.03 is below the 0.05 threshold, so it is significant at that level.
D: false. Significance is not size. The effect could be tiny; you must look at the effect size to judge importance.
E: false (multiple testing). Testing 20 features, about 1 is expected to hit p < 0.05 by chance even if none have a real effect. Reporting only the winner without accounting for the 20 tests manufactures a false positive.

Flashcards

Eight cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page for offline review.

Q. What are the null and alternative hypotheses?

The null (H0) is the skeptical default: no effect/no difference. The alternative (H1) is the claim of a real effect. A test assumes the null and sees if the data forces it off.

Q. What is the logic of a hypothesis test?

Assume the null is true, then ask how likely data at least as extreme as observed would be. Very unlikely under the null = evidence against the null.

Q. Define the p-value.

The probability of data at least as extreme as observed, assuming the null is true. Small p = surprising under the null = evidence against it. Below alpha (often 0.05) = statistically significant.

Q. Why is 'p is the probability the null is true' wrong?

It flips the conditional. p = P(data this extreme | null true), not P(null true | data) — the same direction-swap error as in the Bayes lessons. p = 0.01 is not a 1% chance the null is true.

Q. Does statistically significant mean large or important?

No. Significant means detectable (distinguishable from zero). With enough data a trivial effect can be significant. Always check the effect size for importance.

Q. Does failing to reject the null prove the null is true?

No. Absence of evidence is not evidence of absence; the test may have lacked the power to detect a real effect. You can fail to find a difference without showing there is none.

Q. What is the multiple-testing trap?

Running many tests turns up ‘significance’ by chance (about 1 in 20 at alpha 0.05). Reporting only the winner of many tries manufactures false positives. Be honest about how many tests you ran.

Q. How does sample size affect a significance test?

Bigger samples shrink the standard error (sqrt law), so the same observed difference becomes more clearly significant. A gap that is noise on a small sample can be significant on a large one.