Skip to content

Lesson: Wisdom of crowds: random forests

The last lesson ended on a problem. A single decision tree is wonderfully interpretable but unstable: change a little data and the whole tree reshuffles, and left unchecked it overfits. You could spend a lot of effort taming one tree. The random forest takes a different and almost suspiciously simple route, built on an idea much older than machine learning: the wisdom of crowds.

Ask one expert and you get one opinion, which might be sharp or might be idiosyncratic. Ask hundreds of reasonably good, reasonably independent experts and average their answers, and the individual quirks tend to cancel out while the real signal, the thing they mostly agree on, comes through. A random forest is exactly that crowd, except each expert is a decision tree.

A random forest is an ensemble: a model made by combining many smaller models. Here the smaller models are decision trees, hundreds of them. To make a prediction, you run the example through every tree and combine their answers:

  • For classification, the forest takes the majority vote of its trees.
  • For regression, it takes the average of their predictions.

That is the whole structure. The interesting part is how you get hundreds of different trees, because a crowd of identical trees is no wiser than one tree. The forest manufactures diversity on purpose, in two ways.

Bagging (bootstrap aggregating). Instead of training every tree on the full dataset, each tree gets its own bootstrap sample: a random sample of the training data, drawn with replacement, the same size as the original. Drawing with replacement means some rows appear more than once and others not at all, so every tree sees a slightly different dataset. That alone makes the trees disagree in useful ways.

Made concrete: if the training set has rows numbered 1 to 5, one tree’s bootstrap sample might come out as the multiset 1, 1, 3, 4, 4. Rows 1 and 4 got drawn twice; rows 2 and 5 were never drawn at all. The next tree gets a different draw, and so on. Those left-out rows (2 and 5 here) are not wasted, they become this tree’s out-of-bag examples, which we will use in a moment for a free accuracy check.

Random feature subsets. There is a second twist. At each split, a tree is allowed to consider only a random subset of the features, not all of them. Without this, if one feature were strongly predictive, nearly every tree would split on it first and the trees would end up nearly identical. Forcing each split to choose from a random handful of features de-correlates the trees, so they explore different patterns.

Together, bagging and random feature subsets give you a forest of trees that are individually decent and, crucially, make their mistakes in different places.

Here is the part worth slowing down on, because it is the whole point. Each individual tree in the forest still overfits, it still chases the noise in its own bootstrap sample. So how can averaging a pile of overfit trees be better than any one of them?

Because their errors are largely independent. Where the real pattern lives, the trees agree, because that signal is in everyone’s data. Where the noise lives, each tree is wrong in its own random way: one tree’s flukes are not another’s. When you average (or vote), the agreement on the signal reinforces, and the scattered, independent errors cancel out. The forest keeps the signal and discards the noise.

In the language we will make precise in Phase 4: a single deep tree has low bias but high variance (it is unstable). Averaging many such trees keeps the low bias but slashes the variance, and lower variance means better generalization to new data. That is the trade the wisdom of crowds buys you.

Take a forest of five trees deciding whether one email is spam. They were each trained on different bootstrap samples, so they do not all agree:

Tree 1: SPAM
Tree 2: SPAM
Tree 3: NOT SPAM
Tree 4: SPAM
Tree 5: NOT SPAM
Majority vote: 3 SPAM vs 2 NOT SPAM -> forest predicts SPAM

Two of the five trees got it wrong, and the forest still gets it right, because the majority carried it. No single tree has to be reliable for the crowd to be reliable, as long as the trees are better than guessing and their errors are not all the same. Real forests use hundreds of trees, which makes the vote far more stable than this five-tree sketch.

Bagging hands you a small bonus. Because each tree was trained on a bootstrap sample, on average about a third of the data was left out of any given tree’s training. You can test each tree on exactly the examples it never saw and average the results to get an out-of-bag error estimate, a built-in measure of how well the forest generalizes, without having to set aside a separate validation set. It is a neat side effect of the way the forest is built.

Random forests are not free wins. You give up the one thing that made a single tree special: interpretability. You can read one tree as a flowchart and explain its decision; you cannot meaningfully read five hundred trees voting. A forest is much closer to a black box. It is also larger and slower than a single tree, both to train and to run. The deal is a clear one: you trade the readable decision path for a big gain in accuracy and stability. Forests do still offer feature-importance estimates (which features mattered most across the crowd), which recovers a little of the lost insight, but not the step-by-step explanation a single tree gives.

The random forest is one of the most useful models in practice, and often the first thing a data scientist reaches for on tabular, spreadsheet-shaped data. It is accurate out of the box, needs little tuning, and resists overfitting, which makes it a strong baseline that more complex models have to beat to justify themselves. The deeper idea, that combining many diverse, independently-wrong models beats any single one, recurs all over machine learning, from model ensembles in competitions to averaging several runs of a large model to get a steadier answer. “Ask a diverse crowd and combine the answers” is a pattern worth carrying everywhere.

  • Thinking more trees cause overfitting. They do not, in the way deepening one tree does. Adding trees mostly improves the estimate and then plateaus; you cannot overfit a forest just by growing it larger.
  • Expecting to read a forest. The interpretability of a single tree is gone. If you need an explainable decision path, a forest is the wrong tool.
  • Forgetting that the randomness is the point. Remove the bagging and the random feature subsets and the trees become near-identical, and the forest collapses back to roughly one tree. Diversity is what makes the crowd wise.
  • A random forest is a crowd of decision trees that vote (classification) or average (regression).
  • Diversity comes from two sources: bagging (each tree on a bootstrap sample) and random feature subsets at each split.
  • Averaging cancels the trees’ independent errors, keeping the shared signal and slashing variance, which is why the crowd generalizes better than any one overfit tree.
  • You trade interpretability for accuracy and stability, and get a free out-of-bag error estimate along the way.

A random forest builds its trees independently and in parallel, then averages them, a strategy called bagging. The next lesson combines trees in a completely different way: instead of growing them independently, it grows them in sequence, each new tree focused on fixing the mistakes the previous ones made. That is boosting, and on many problems it squeezes out even more accuracy.