Summary: Train, test, and cross-validation

An honest evaluation needs data the model never trained on, and on small data needs more than one held-out split: k-fold cross-validation rotates through k held-out folds and averages the scores, giving a stable test-error estimate that single splits cannot. This summary is the scan version of the full lesson.

Core ideas

Training accuracy is fit, not generalization. The minimum honest evaluation is on data the model did not see during training.
The simple split: train on roughly 80 percent, evaluate once on the held-out 20 percent. The test set must stay untouched during training and tuning, otherwise the score is contaminated.
The three-way split: train fits the model, validation is used to tune hyperparameters, test is the one-shot final evaluation. Tuning on the test set is the single most common mistake.
Why one split is not enough. It can be lucky or unlucky; on small data the variance is large.
k-fold cross-validation. Split into k folds (commonly 5 or 10); for each fold, train on the other k-1 and test on it; average the k scores. Every point is used for both training and testing, and the average is far more stable than a single split.
Useful variants. Stratified k-fold (preserves class proportions, default for imbalanced classification), leave-one-out (k = n, slow), and time-series CV (chronological splits; never random for time-ordered data).
Data leakage silently inflates scores: tuning on the test set, preprocessing before splitting, using future data, duplicates across train and test. Each looks like a better model and is not.

What changes for you

The right reflex when reading any model claim is now in place: ask on what data, evaluated how, with what tuning done where. The number alone is meaningless; the procedure that produced it is what counts. When you build your own models, the standard hygiene is: split first, fit preprocessing on training only, use cross-validation for model selection, keep a held-out test set for the final number, and never let the test set influence anything during the process. Avoiding the four common leakage traps protects most of the value of what you measure. The next and final lesson of the track turns to what to measure with: when accuracy lies (and it often does on imbalanced problems), the confusion matrix and its derived metrics (precision, recall, ROC) tell the truth.