Practice: Reading the results: the confusion matrix, precision, recall, and ROC

Self-check

Seven short questions. Try to answer each one before opening the collapsible.

1. Why does accuracy lie on imbalanced data?

Show answer

The majority class dominates the count, so a model that always predicts the majority can score very high without doing the job. A 99% accurate fraud detector that catches zero fraud is still 99% accurate because 99% of transactions are not fraud.

2. What are TP, TN, FP, and FN?

Show answer

TP: predicted positive, actually positive (correct catch). TN: predicted negative, actually negative (correct reject). FP: predicted positive, actually negative (false alarm). FN: predicted negative, actually positive (miss).

3. State the formulas for precision and recall, and the question each answers.

Show answer

Precision = TP / (TP + FP) = of everything I flagged as positive, what fraction was actually positive? Recall = TP / (TP + FN) = of all actual positives, what fraction did I catch?

4. Why do precision and recall usually trade off?

Show answer

Because the decision threshold is shared. Lowering the threshold (predict positive more easily) catches more real positives (recall up) at the cost of more false alarms (precision down). Raising it does the opposite.

5. When does high recall matter more than precision, and when does high precision matter more?

Show answer

High recall matters when missing a positive is much worse than a false alarm (medical screens, fraud detection, safety alerts). High precision matters when a false alarm is much worse than missing a positive (spam filter that wrongly catches real email, search-engine top results, recommendations).

6. What do the two axes of a ROC curve show, and what corner is “perfect”?

Show answer

X axis: false positive rate = FP / (FP + TN). Y axis: true positive rate (recall) = TP / (TP + FN). Each point is a different threshold. The top-left corner (high recall, low false-positive rate) is perfect; the diagonal is random.

7. What does AUC summarize, and one caveat?

Show answer

The area under the ROC curve, a single number for how well the model separates the classes independent of threshold (0.5 = random, 1.0 = perfect). Caveat: on very imbalanced data, AUC can look optimistic; the precision-recall curve is often more informative.

Try it yourself: compute the metrics

A medical screen is evaluated on 2,000 patients, of whom 100 actually have the condition. The model produces this confusion matrix:

                            Actual: Diseased    Actual: Healthy
Predicted: Diseased            TP = 70             FP = 200
Predicted: Healthy             FN = 30             TN = 1700

Compute: accuracy, precision, recall, and specificity, each as a percentage. Then say in one sentence whether this is a good model for screening.

Show answer

accuracy   = (TP + TN) / total       = (70 + 1700) / 2000  = 1770 / 2000 = 88.5%
precision  = TP / (TP + FP)          = 70 / (70 + 200)     = 70 / 270   ~ 25.9%
recall     = TP / (TP + FN)          = 70 / (70 + 30)      = 70 / 100   = 70.0%
specificity = TN / (TN + FP)         = 1700 / (1700 + 200) = 1700 / 1900 ~ 89.5%

Is it a good screen? Mixed. Recall is 70% (we catch 70 of the 100 sick patients, miss 30), which for a medical screen is mediocre, missing 30% of real cases is dangerous. Precision is only 26% (of every alarm, three quarters are false), which means a lot of healthy patients get further testing for no reason. The 88.5% accuracy headline hides both problems. For a screen where missing cases is costly, you would lower the decision threshold to push recall up (at the cost of even more false alarms), then add a confirmatory test for everyone the screen flags.

Try it yourself: pick the metric

For each scenario, name the metric that matters most and explain in one sentence why.

A. A spam filter that drops messages into a hidden spam folder.
B. A cancer screening test that flags patients for follow-up diagnostics.
C. A search engine where users only look at the top 5 results.

Show answer

A: Precision. A false alarm (real email marked as spam and hidden) is much worse than a missed catch (a spam message reaching the inbox, which the user can delete). You want every “spam” verdict to be trustworthy.
B: Recall. Missing a real cancer case (false negative) is far worse than a false alarm, which only triggers an additional test. The screen’s job is to catch every real case it can, even at the cost of investigating some healthy patients.
C: Precision (at top-k). Only the first few results are seen, so each one must be trustworthy. Missing relevant results that would have appeared on page 5 is much less costly than putting an irrelevant result in the top slot.

The pattern: name which kind of error costs more, and the metric to optimize falls out.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.

Q. Why does accuracy lie on imbalanced data?

The majority class dominates the count; a model that always predicts the majority can score very high without doing the job (e.g., 99% accurate by ignoring the 1% positive class).

Q. What is the confusion matrix?

A 2x2 table of TP / FP / FN / TN counts. Every other classification metric is derived from these four numbers.

Q. Define precision.

TP / (TP + FP): of everything I flagged as positive, what fraction was actually positive? High precision = few false alarms.

Q. Define recall (sensitivity, true positive rate).

TP / (TP + FN): of all actual positives, what fraction did I catch? High recall = few misses.

Q. Why do precision and recall trade off?

The decision threshold controls both. Lowering the threshold raises recall and lowers precision; raising it does the opposite. You move along the curve, you do not get both for free.

Q. What is the F1 score?

The harmonic mean of precision and recall: F1 = 2 * P * R / (P + R). A single number that drops sharply when either precision or recall is bad.

Q. When is high recall more important than high precision?

When missing a positive is much worse than a false alarm: medical screens, fraud detection, safety alerts. Better to investigate too many than miss a real case.

Q. When is high precision more important than high recall?

When a false alarm is much worse than a miss: spam filters (real email lost), search-engine top results, recommendations. Every “yes” needs to be trustworthy.

Q. What do the ROC curve's axes show?

X: false positive rate = FP/(FP+TN). Y: true positive rate = recall = TP/(TP+FN). Each point is a different threshold; the top-left corner is perfect.

Q. What is AUC and one caveat?

The area under the ROC curve, a threshold-independent measure of class separation (0.5 random, 1.0 perfect). Caveat: on very imbalanced data, AUC can look optimistic; the precision-recall curve is often more honest.