Practice: Train, test, and cross-validation

Self-check

Seven short questions. Try to answer each one before opening the collapsible.

1. Why is reporting accuracy on the training data dishonest?

Show answer

Because the model was fit to that data, training accuracy measures fit, not generalization. A model can score perfectly on training by memorizing every example and still fail on anything new. The minimum honest evaluation is on data the model did not see during training.

2. What is the rule about the test set?

Show answer

It must be untouched during training and tuning. The moment you use it to compare models or adjust hyperparameters, you have contaminated it and its score is no longer a fair estimate of unseen performance. The test set is a one-shot evaluation.

3. Why introduce a validation set?

Show answer

To tune hyperparameters without touching the test set. Train fits the model, validation is used to compare hyperparameter settings, and test is the final one-shot evaluation of the chosen configuration.

4. Describe k-fold cross-validation in one paragraph.

Show answer

Split the data into k equal-sized folds. For each fold, train on the other k - 1 folds and test on the held-out one, recording the score. Average the k scores. The result is a stable estimate of generalization performance, with every data point used for both training and testing (just not at the same time).

5. When should you use stratified k-fold?

Show answer

For classification, especially when classes are imbalanced. It keeps each fold’s class proportions close to the overall dataset’s, preventing folds that accidentally contain almost none of the rare class.

6. Why is random k-fold the wrong tool for time-series data?

Show answer

Because random folds let the model train on points that come after the points it is predicting, which it would never have access to in deployment. Time-series cross-validation splits chronologically: train on the past, test on the next chunk.

7. Name two ways data leakage commonly happens.

Show answer

Any two of: tuning hyperparameters on the test set, preprocessing (scaling/imputing) before splitting so test statistics inform training, using future data to predict the past in time-series, or duplicates that appear in both train and test.

Try it yourself: average the CV scores

A 5-fold cross-validation produced these per-fold accuracies:

fold 1: 0.81
fold 2: 0.79
fold 3: 0.83
fold 4: 0.82
fold 5: 0.80

Compute the cross-validated estimate, and say in one sentence why it is more trustworthy than any single fold’s score.

Show answer

sum = 0.81 + 0.79 + 0.83 + 0.82 + 0.80 = 4.05
average = 4.05 / 5 = 0.81

Cross-validated accuracy = 0.81.

It is more trustworthy than any single fold’s score because it averages over 5 different held-out sets. The variance you see across folds (0.79 to 0.83) is exactly the wobble that one random split would have hidden inside a single number. Averaging cancels much of that randomness, so the 0.81 is a far more stable estimate of how the model performs on unseen data.

Try it yourself: spot the leakage

A team builds a fraud classifier on 1,000 transactions. They follow this procedure:

1. Standardize all 1,000 records (subtract mean, divide by std).
2. Split the standardized data into 800 train / 200 test.
3. Train the model on the 800 training records.
4. Evaluate on the 200 test records. Report 95% accuracy.

What is wrong with this procedure, and what is the fix?

Show answer

The preprocessing happens before the split, so the mean and standard deviation used to standardize were computed over all 1,000 records, including the 200 that became the test set. Information from the test set has leaked into the standardization that the training data saw, and the test data has been scaled using its own statistics. The 95% number is therefore optimistically inflated; it is not a fair estimate of performance on truly unseen data.

The fix: split first, then fit the scaler on the 800 training records only (compute its mean and std), and apply that fitted scaler to the test records (without recomputing). The test data is now scaled with statistics it never contributed to, and the reported accuracy is honest. The same rule applies to every preprocessing step (imputation, encoding, feature selection): fit on training, apply to test.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.

Q. Why is training accuracy not a measure of generalization?

The model was fit to that data, so high training accuracy can come from memorization rather than learning the pattern. Generalization is judged on data the model did not see.

Q. What is the rule about the test set?

It must be untouched during training and tuning, and used only for a one-shot final evaluation. Any use during model selection contaminates it.

Q. What is a validation set for?

Tuning hyperparameters without touching the test set: train fits the model, validation compares settings, test is the final unbiased check.

Q. Describe k-fold cross-validation.

Split data into k folds, train on k-1 and test on the held-out one for each round, then average the k test scores. Stable estimate; every point used for both training and testing.

Q. What problem does cross-validation solve over a single train/test split?

A single split can be lucky or unlucky; its score has high variance, especially on small data. CV averages over many splits to give a more stable estimate.

Q. When should you use stratified k-fold?

For classification with imbalanced classes; it keeps class proportions consistent across folds so each fold is representative.

Q. Why is random k-fold wrong for time-series?

It lets the model train on future data to predict the past, which is unavailable in deployment. Use chronological splits: train on the past, test on the next chunk.

Q. Give one example of data leakage.

Any of: tuning on the test set; preprocessing before splitting; using future data in time-series; duplicate samples appearing in train and test.

Q. What is the rule for preprocessing splits?

Fit on training data only; apply (do not refit) to validation and test. Otherwise test statistics leak into training and inflate scores.

Q. What question should you ask of any reported model accuracy?

On what data, evaluated how, and with what tuning done where? Numbers without that context usually deflate the moment you ask.