Cheatsheet: Train, test, and cross-validation
Why honest evaluation matters
Section titled “Why honest evaluation matters”| Claim | Verdict |
|---|---|
| ”99% training accuracy” | proves nothing; could be memorization |
| ”evaluated on held-out test data” | meaningful |
| ”cross-validated” | meaningful and stable |
The split recipes
Section titled “The split recipes”| Setup | Train | Validation | Test | Use |
|---|---|---|---|---|
| Simple | ~80% | — | ~20% | no hyperparameter tuning |
| Three-way | 60-70% | 10-20% | ~20% | tune on validation; one-shot test |
The test set is one-shot: untouched during training and tuning.
k-fold cross-validation
Section titled “k-fold cross-validation”| Step | Action |
|---|---|
| 1 | split data into k equal folds (commonly 5 or 10) |
| 2 | for each fold i: train on the other k-1; test on i; record score |
| 3 | average the k scores |
| Output | a stable cross-validated estimate of generalization |
Worked: 5 fold scores 0.81, 0.79, 0.83, 0.82, 0.80 -> sum 4.05 / 5 = 0.81.
Useful variants
Section titled “Useful variants”| Variant | Use when |
|---|---|
| Stratified k-fold | classification with imbalanced classes (preserves proportions) |
| Leave-one-out (LOOCV) | very small datasets; slow, high-variance |
| Time-series CV | time-ordered data; split chronologically, never random |
Data leakage (the four common traps)
Section titled “Data leakage (the four common traps)”| Trap | Effect | Fix |
|---|---|---|
| Tuning on the test set | optimistic, contaminated | use validation or CV |
| Preprocess before splitting | test stats leak into train | split first; fit scaler on train only |
| Random fold on time-series | future used to predict past | chronological splits |
| Duplicates across train/test | model “memorizes”; inflated | de-duplicate first |