Skip to content

Lesson: Overfitting and the bias-variance tradeoff

Lesson 1 planted a rule we have returned to in nearly every lesson since: a model is only as good as it does on data it has never seen. We have built a substantial toolbox around it (regression, classification, ensembles, clustering, compression), all the while waving at words like “overfitting” without ever defining them precisely. This phase pays that debt. This lesson formalizes what can go wrong, and the next two show how to measure it honestly.

There are two distinct ways a model can fail to generalize. They are different, they pull in opposite directions, and the single most useful skill in applied machine learning is recognizing which one you are facing right now. They have technical names worth learning: bias and variance.

Bias is how badly the model misses the true pattern even when it has all the data it could ever want. A model with high bias is too simple to capture what is going on. A straight line fit to clearly curved data has high bias: no matter how many points you give it, a line is the wrong shape, and predictions will be systematically off. High bias is the technical name for underfitting.

Variance is how much the model’s predictions change if you train it on a different sample of data. A model with high variance is too sensitive to the specific examples it saw. A wildly twisty curve fit through every training point will look completely different on a fresh sample, because it is chasing the noise. High variance is the technical name for overfitting.

The total error a model makes on new data has three pieces:

total error = bias^2 + variance + irreducible noise

The third piece (irreducible noise) is the noise in the world itself, the inherent randomness no model can predict away. You cannot do anything about it. The first two are what you control, and they are where the work happens.

Here is the friction the lesson is named for. Making a model more flexible (deeper trees, more parameters, higher-degree polynomials) lowers bias (it can capture more complex patterns) and raises variance (it gets more sensitive to the specific training sample). Making a model simpler does the opposite: lower variance, higher bias. You cannot just turn both knobs down at once; reducing one tends to raise the other. The art is finding the spot where their sum is lowest.

Picture fitting a curve to ten noisy points that came from an underlying sine wave:

straight line -> high bias (misses the curve), low variance (stable)
UNDERFITS
degree-2 polynomial -> lower bias (captures curvature), low variance
GOOD FIT
degree-10 polynomial -> low bias (passes through every point), HUGE variance
OVERFITS the noise

Plot total error against model complexity and you get a U-shape: high at both ends (underfit on the left, overfit on the right), with a minimum in the middle. The sweet spot is what you are trying to find for any given problem.

The diagnostic: training error vs test error

Section titled “The diagnostic: training error vs test error”

Here is the practical payoff, and the capability of this lesson. You cannot read bias and variance off the model directly. But you can diagnose which one is hurting you by comparing the model’s error on the training data (data it learned from) against its error on a held-out test set (data it never saw). The pattern tells you which side of the U you are on.

training error HIGH, test error HIGH (similar) -> HIGH BIAS, underfitting
training error LOW, test error HIGH (big gap) -> HIGH VARIANCE, overfitting
training error LOW, test error LOW (small gap) -> GOOD FIT

That triple is one of the most useful patterns in machine learning. If training error is high, the model cannot even fit the data it has, you are underfitting and need a more flexible model. If training error is great but test error is terrible, the model has memorized the training set without learning the underlying pattern, you are overfitting and need a simpler model, more data, or stronger regularization. If both are low and close together, you are in the sweet spot.

How each method we have seen sits on the spectrum

Section titled “How each method we have seen sits on the spectrum”

Bias and variance is also the right lens for the toolbox the track has built:

  • Linear and logistic regression are simple, low-variance models that can underfit complex problems. The classic high-bias end.
  • Deep, unpruned decision trees are highly flexible, low-bias, and famously high-variance. The classic overfit-by-default end.
  • Random forests are the textbook variance-reducer: take many high-variance trees and average them, and the variance falls sharply while bias stays low. Bagging is a bias-variance move.
  • Boosting is the textbook bias-reducer: chain weak (high-bias) learners so each one fixes the residuals, lowering bias step by step. The flip side is that boosting can overfit if pushed too far, the variance starts to creep back up.
  • Support vector machines with their soft-margin parameter C give you a direct dial: smaller C means wider margin, lower variance, higher bias; larger C means tighter margin, higher variance, lower bias.

Looked at this way, much of classical machine learning is the same diagram: a family of models with a complexity knob, and a search for the spot on the U-curve that minimizes generalization error.

The dedicated tool for nudging a model away from the high-variance end is regularization: techniques that penalize the model for being too complex. The two best-known cases live with linear and logistic regression, and they are simple enough to name:

  • Ridge (L2) regression adds a penalty for the sum of squared coefficients. It shrinks all coefficients toward zero, lowering variance.
  • Lasso (L1) regression adds a penalty for the sum of absolute coefficients. It also shrinks coefficients but can drive some all the way to zero, effectively performing feature selection.

The intuition is the same in both: pay a price for complexity, and the model leans simpler. Regularization is the standard dial in linear models for moving leftward on the U-curve when you suspect overfitting.

A historical note worth flagging, because it crosses your path everywhere these days. The classic bias-variance U-shape was discovered on models like the ones in this track, and it holds for them. Very large neural networks behave more strangely. Past a certain size, test error can drop again even as the model overfits the training data, a phenomenon called double descent. The classical framework still organizes everyone’s thinking, but it is not the whole story at deep-learning scale. For the models in this track, the U-curve is right.

When you read that someone “tuned a model,” what they almost always did was move it along this curve, finding the complexity that gave the best generalization. When you tune one yourself, you are doing the same. The training-vs-test-error diagnostic is the most actionable number-reading skill in applied machine learning: it tells you what to do next. High training error means add complexity; low training but high test error means take it away or get more data. That ability to look at two numbers and know which lever to pull is the difference between fumbling and engineering.

  • Reading training accuracy as model quality. It is fit, not generalization. A model can perfectly predict its training data and be useless on anything new.
  • Fixing only one side. More data fights variance but does little for bias; a more complex model fights bias but raises variance. Diagnose first.
  • Forgetting the irreducible-noise floor. No model can do better than the inherent noise in the data. Chasing the gap below it is chasing ghosts.
  • Trusting a single train/test split. One split can be lucky or unlucky, which is exactly the problem the next lesson on cross-validation solves.
  • Bias and variance are two different ways a model fails to generalize. Bias is underfitting (too simple); variance is overfitting (too sensitive to the sample).
  • They trade off. Lowering one usually raises the other; total generalization error is a U-shape in model complexity.
  • Diagnose from training and test error. High training + high test = high bias; low training + high test (big gap) = high variance; low training + low test = good fit.
  • Regularization (ridge, lasso) penalizes complexity as the standard dial for moving toward the low-variance side.

We can name what we want (the sweet spot in the U-curve) and read the diagnostic (training versus test error). But for that diagnostic to work, the test error has to be honest: an estimate of how the model performs on truly unseen data. A single train/test split is one estimate, and one estimate can be lucky or unlucky. The next lesson is about doing the holdout right, with train/test splits, validation sets, and cross-validation.