Practice: Overfitting and the bias-variance tradeoff
Self-check
Section titled “Self-check”Seven short questions. Try to answer each one before opening the collapsible.
1. What is bias, in one sentence?
Show answer
How badly the model misses the true pattern even given infinite training data. High bias means the model is too simple to capture what is happening; it underfits.
2. What is variance, in one sentence?
Show answer
How much the model’s predictions change if you train it on a different sample of data. High variance means the model is too sensitive to its specific training set; it overfits, chasing noise.
3. Why is it called a “tradeoff”?
Show answer
Because making a model more flexible lowers bias but raises variance, and making it simpler does the reverse. You cannot turn both knobs down at once; the goal is to balance them so total error is smallest.
4. What does the typical total-error-vs-complexity curve look like?
Show answer
A U-shape. Error is high on the left (too simple, underfitting), high on the right (too complex, overfitting), and lowest somewhere in the middle (the sweet spot).
5. State the train-vs-test diagnostic for high bias, high variance, and a good fit.
Show answer
High bias (underfitting): training error high, test error high (similar). High variance (overfitting): training error low, test error high (big gap). Good fit: training error low, test error low (small gap).
6. What does regularization do, and name two examples for linear models.
Show answer
It penalizes model complexity to push the model toward lower variance. Ridge (L2) adds a penalty for the sum of squared coefficients; lasso (L1) adds a penalty for the sum of absolute coefficients (and can drive some to zero, performing feature selection).
7. Why is “irreducible noise” important to remember?
Show answer
Because no model can do better than the inherent randomness in the data. The lowest possible test error is bounded by it; chasing the gap below it is chasing ghosts.
Try it yourself: diagnose from the numbers
Section titled “Try it yourself: diagnose from the numbers”For each scenario, name whether the model is suffering from high bias, high variance, or is at a good fit, and say one thing you would try next.
A. training error = 30%, test error = 32% (similar, both high)B. training error = 2%, test error = 25% (big gap)C. training error = 5%, test error = 7% (small gap)Show answer
- A: HIGH BIAS, underfitting. The model cannot even learn the training data well. Next: use a more flexible model (deeper tree, more features, polynomial terms, a less-aggressive regularizer).
- B: HIGH VARIANCE, overfitting. The model memorized the training data (2%) but fails on new data (25%). Next: simplify the model (shallower tree, fewer features, stronger regularization), collect more training data, or average several models (e.g., a random forest).
- C: GOOD FIT. Both errors are low and close together. Next: ship it (or, if your problem demands better, accept that the irreducible noise floor may be limiting how much further you can push).
That triple, comparing the two errors at a glance, is the highest-leverage diagnostic in applied machine learning.
Try it yourself: place the method on the spectrum
Section titled “Try it yourself: place the method on the spectrum”For each model, say where it tends to sit on the bias-variance spectrum by default, and which problem it is more likely to suffer from.
A. A simple linear regression with 3 featuresB. An unpruned decision tree of depth 30C. A random forest of 500 deep treesD. AdaBoost run for 10,000 rounds with no early stoppingShow answer
- A: Low variance, high bias. Linear regression is a simple model; it is stable across samples (low variance) but cannot capture complex patterns (high bias). More likely to underfit on non-linear problems.
- B: Low bias, high variance. A deep, unpruned tree can fit almost anything (low bias) but is famously unstable (high variance). More likely to overfit.
- C: Low bias, low variance. A forest averages many high-variance trees, sharply lowering variance while keeping bias low. The textbook variance-reducer; usually generalizes well.
- D: Variance creeping back up. Boosting reduces bias step by step, but pushed too far the ensemble starts fitting noise. More likely to overfit with too many rounds; needs early stopping or a smaller learning rate.
Looking at the toolbox this way, most classical ML choices are moves along this same diagram.
Flashcards
Section titled “Flashcards”Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.
Q. What is bias?
How badly the model misses the true pattern even given infinite data: too-simple-to-capture-the-pattern. High bias = underfitting.
Q. What is variance?
How much the model’s predictions change when trained on a different sample: too-sensitive-to-the-specific-data. High variance = overfitting.
Q. Why is there a tradeoff?
Increasing model flexibility lowers bias but raises variance, and vice versa. You cannot lower both at once; you balance them.
Q. What does total generalization error vs complexity look like?
A U-shape: high at both ends (underfit on the left, overfit on the right), with a minimum somewhere in the middle.
Q. Train/test diagnostic for high bias?
Training error high, test error high (similar). The model cannot even learn the training data; underfitting.
Q. Train/test diagnostic for high variance?
Training error low, test error high (big gap). The model memorized the training set; overfitting.
Q. Train/test diagnostic for a good fit?
Training error low, test error low (small gap). Sweet spot.
Q. What is regularization, and name two types?
Penalizing model complexity to lower variance. Ridge (L2: sum of squared coefficients) and lasso (L1: sum of absolute coefficients; can zero some out).
Q. Why is irreducible noise important?
It is the inherent randomness in the data; no model can do better. The lowest possible test error is bounded by it.
Q. In bias-variance terms, what do random forests and boosting each reduce?
Random forests (bagging) primarily reduce variance by averaging many high-variance trees. Boosting primarily reduces bias by chaining weak (high-bias) learners.