Skip to content

How sure are we? confidence intervals

The previous lesson established that any number measured on a sample is an estimate with a standard error, a built-in wobble. So reporting just the number, “the model is 90% accurate,” is dishonest by omission: it hides how much that 90% might move on a different test set. A confidence interval fixes that. It reports the estimate together with a range of plausible values for the truth, turning “90% accurate” into “90% accurate, give or take 4 points.” This lesson builds the interval, and then spends real care on the one thing about it almost everyone gets wrong.

A single measured number is a point estimate. It is your best single guess at the parameter, but on its own it pretends to a precision it does not have. A confidence interval wraps the estimate in a range:

confidence interval = point estimate +/- margin of error

The margin of error is how far the truth might plausibly sit from your estimate, and it is built straight from the standard error of the previous lesson:

margin of error = (a multiplier) x (standard error)

For the common 95% confidence level, the multiplier is about 2 (it is 1.96, which comes directly from the normal distribution and the 68-95-99.7 rule: about 95% of a normal sits within two standard deviations). So the everyday rule is:

95% confidence interval is about estimate +/- 2 x standard error

Work it on a model. You measure 90% accuracy on a test set, and the standard error of that accuracy is 2 percentage points. Then:

95% CI = 90% +/- 2 x 2% = 90% +/- 4% = [86%, 94%]

You report not “90%” but “90%, with a 95% confidence interval of 86% to 94%.” That range is the honest statement of what your test set actually tells you.

Two dials control how wide the interval is, and they pull in opposite directions.

  • Sample size. The margin uses the standard error, which is sigma over root n, so more data shrinks the interval. If a bigger test set cut the standard error from 2 to 1 percentage point, the 95% interval would tighten from [86%, 94%] to [88%, 92%]. This is the square-root law from the previous lesson, now visible as interval width.
  • Confidence level. Wanting to be more confident means casting a wider net. A 99% interval uses a bigger multiplier (about 2.6 instead of 2), so the same data gives roughly 90% +/- 5.2% = [84.8%, 95.2%], wider than the 95% one. You cannot have both maximal confidence and a tight range from fixed data; you trade them off.

The takeaway: a narrow interval at high confidence comes from one place, more data. Confidence level alone just trades width for certainty.

The interpretation almost everyone gets wrong

Section titled “The interpretation almost everyone gets wrong”

Here is the subtle part, and it matters. A 95% confidence interval does not mean “there is a 95% probability that the true value lies in this particular interval.” That reading is natural and wrong.

Why wrong? Because the true parameter is a fixed number, not random. Either it is in your interval or it is not; there is no probability about this specific interval, the truth does not roll dice. What is random is the interval itself, which depends on which sample you happened to draw. The correct interpretation is about the procedure:

If you repeated the whole sampling-and-interval-building process many times, about 95% of the intervals you construct would contain the true parameter.

So “95% confidence” is a statement about the long-run reliability of your method, not about this one interval. In everyday use people do speak loosely of “being 95% confident the truth is in here,” and that shorthand is tolerable, but the precise meaning is the procedure’s hit rate. (It also does not mean 95% of the data falls in the interval; the interval is about the parameter, not the spread of individual values.)

Twenty 95 percent confidence intervals vs a fixed true parameter, with about one in twenty missing the truth A panel showing twenty horizontal bars stacked vertically, each representing a 95 percent confidence interval from one hypothetical study. The true parameter value of 5 is drawn as a vertical purple dashed reference line through the center. Nineteen of the bars cross the true value and are drawn in blue. One bar shown in amber lies entirely to the right of the true value, missing it: that interval did not contain the true parameter. The legend states that 95 percent of intervals would contain the truth in the long run. true μ = 5 1 3 5 7 9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 95% confidence interval: 19 intervals contain μ 1 interval misses 95% is about the procedure
Twenty repetitions of the same study, each producing a 95 percent confidence interval. The truth (purple line) is fixed; the intervals shift around it because the sample means shift. About one in twenty intervals misses by chance. The 95 percent claim is about the long-run behavior of the procedure, not about any single interval being right 95 percent of the time.

Confidence intervals are how you read AI results without fooling yourself.

  • Report metrics with intervals, not bare numbers. “90% accuracy” invites false precision; “90%, 95% CI [86%, 94%]” tells the reader how much to trust it. On a small test set the interval is wide, which is itself the honest message: you do not actually know the accuracy to the percentage point.
  • Overlapping intervals mean “cannot tell apart.” If model A scores 90% with interval [86%, 94%] and model B scores 91% with interval [87%, 95%], the intervals overlap heavily, and you cannot conclude B is better from this data. The point estimates differ; the evidence does not support a real difference. This is the seed of the next lesson, hypothesis testing.
  • Leaderboard caution. A benchmark difference of a few tenths of a percent on a small test set often sits well inside the confidence intervals, meaning the “winner” may be a sampling fluke. Asking “what is the interval” is how you avoid being fooled by a tiny lead.

When you see a single metric reported with no interval, the right instinct now is to ask how big the test set was and how wide the interval would be, because that range is the difference between a result and a guess.

  • The probability misreading. A 95% interval is not a 95% chance the truth is in this specific range. The parameter is fixed; the interval is what varies, and 95% is the long-run rate at which such intervals capture the truth.
  • Thinking the interval covers 95% of the data. It is a range of plausible values for the parameter (like the mean or accuracy), not a range that holds 95% of individual data points.
  • Forgetting that higher confidence means a wider interval. A 99% interval is wider than a 95% one on the same data; you buy confidence with width.
  • Comparing point estimates while ignoring the intervals. Two numbers that look different can have heavily overlapping intervals, in which case the difference may be noise. Always check whether the ranges overlap.
  • A point estimate hides uncertainty; a confidence interval (estimate plus or minus a margin of error) shows it. The margin is a multiplier times the standard error, and for 95% confidence the multiplier is about 2.
  • A 95% interval is roughly estimate +/- 2 standard errors. Example: 90% accuracy with a 2-point standard error gives [86%, 94%].
  • More data narrows the interval (the square-root law); higher confidence widens it. A tight, high-confidence interval comes from more data, not from turning a dial.
  • The correct interpretation is about the procedure: about 95% of intervals built this way would contain the truth. It is not a 95% probability for this one interval, and not a range covering 95% of the data.
  • In AI, report metrics with intervals; overlapping intervals mean two results are indistinguishable; and a small test set gives a wide interval that honestly signals how little you know.