Skip to content

Testing a claim: hypothesis testing and p-values

This is lesson 13 of Track 9 (Statistics & Probability for AI) and the third lesson of Phase 4 (From sample to truth). The previous lesson showed that two results with overlapping confidence intervals cannot be told apart; this lesson is the formal machinery for deciding whether an observed effect is real or noise. You will learn to set up a hypothesis test, define the p-value precisely, and, most importantly, refuse the three misreadings that make the p-value the most abused number in science. The source curriculum is Khan Academy’s Statistics & Probability course, by Sal Khan and the Khan Academy team, freely available and cited as further study.

The lesson sets up the null and alternative hypotheses, explains the assume-the-null-and-measure-surprise logic, works a coin-bias example (63 of 100 heads, p about 0.01), and defines the p-value as the probability of data this extreme given the null. It then spends real care on what the p-value is not: not the probability the null is true (the flipped conditional from the Bayes lessons), not a measure of importance (significant is not large), and not, when non-significant, proof of no effect. It closes on AI: A/B testing, benchmark comparison, and the multiple-testing trap.

This is lesson 13 of 14, the last technical lesson of Phase 4 before the capstone. It builds on the standard error and central limit theorem from the sampling lesson and is the formal version of the overlapping-intervals check from the confidence-interval lesson. Its central misreading is a direct callback to the conditional-probability and Bayes lessons: a p-value is P(data given null), not P(null given data). The capstone next ties all of this to evaluating AI.

Prerequisites: Confidence intervals (lesson 12) and, behind it, the sampling and standard-error lesson. The conditional-probability and Bayes lessons are worth recalling, since the main p-value error is the same flipped conditional. The arithmetic is light (a z-score-style comparison); the difficulty is conceptual.

The calculation in the lesson is a single standardized comparison (how many standard deviations the data is from what the null predicts), with the p-value read off from there. The practice adds one more such computation to show how sample size changes the verdict. No heavy formulas; the real work is interpreting the result correctly, which the practice drills directly.

  • Set up a null and an alternative hypothesis for a claim
  • Explain the logic of a hypothesis test (assume the null, measure how surprising the data is)
  • Define the p-value correctly as the probability of data this extreme given the null
  • Reject the common misreadings (p is not the probability the null is true; significant is not important)
  • Recognize hypothesis testing in AI (A/B tests, benchmark comparisons) and the multiple-testing trap
  • Read time: about 13 minutes
  • Practice time: about 16 minutes (a self-check, a worked significance test showing how sample size decides the call, a true-or-false p-value interpretation drill, and flashcards)
  • Difficulty: standard (light arithmetic; the challenge is interpreting the p-value, which the lesson and practice target hard)