Testing a claim: hypothesis testing and p-values
What you’ll learn
Section titled “What you’ll learn”This is lesson 13 of Track 9 (Statistics & Probability for AI) and the third lesson of Phase 4 (From sample to truth). The previous lesson showed that two results with overlapping confidence intervals cannot be told apart; this lesson is the formal machinery for deciding whether an observed effect is real or noise. You will learn to set up a hypothesis test, define the p-value precisely, and, most importantly, refuse the three misreadings that make the p-value the most abused number in science. The source curriculum is Khan Academy’s Statistics & Probability course, by Sal Khan and the Khan Academy team, freely available and cited as further study.
The lesson sets up the null and alternative hypotheses, explains the assume-the-null-and-measure-surprise logic, works a coin-bias example (63 of 100 heads, p about 0.01), and defines the p-value as the probability of data this extreme given the null. It then spends real care on what the p-value is not: not the probability the null is true (the flipped conditional from the Bayes lessons), not a measure of importance (significant is not large), and not, when non-significant, proof of no effect. It closes on AI: A/B testing, benchmark comparison, and the multiple-testing trap.
Where this fits
Section titled “Where this fits”This is lesson 13 of 14, the last technical lesson of Phase 4 before the capstone. It builds on the standard error and central limit theorem from the sampling lesson and is the formal version of the overlapping-intervals check from the confidence-interval lesson. Its central misreading is a direct callback to the conditional-probability and Bayes lessons: a p-value is P(data given null), not P(null given data). The capstone next ties all of this to evaluating AI.
Before you start
Section titled “Before you start”Prerequisites: Confidence intervals (lesson 12) and, behind it, the sampling and standard-error lesson. The conditional-probability and Bayes lessons are worth recalling, since the main p-value error is the same flipped conditional. The arithmetic is light (a z-score-style comparison); the difficulty is conceptual.
About the math
Section titled “About the math”The calculation in the lesson is a single standardized comparison (how many standard deviations the data is from what the null predicts), with the p-value read off from there. The practice adds one more such computation to show how sample size changes the verdict. No heavy formulas; the real work is interpreting the result correctly, which the practice drills directly.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Set up a null and an alternative hypothesis for a claim
- Explain the logic of a hypothesis test (assume the null, measure how surprising the data is)
- Define the p-value correctly as the probability of data this extreme given the null
- Reject the common misreadings (p is not the probability the null is true; significant is not important)
- Recognize hypothesis testing in AI (A/B tests, benchmark comparisons) and the multiple-testing trap
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 16 minutes (a self-check, a worked significance test showing how sample size decides the call, a true-or-false p-value interpretation drill, and flashcards)
- Difficulty: standard (light arithmetic; the challenge is interpreting the p-value, which the lesson and practice target hard)