Skip to content

Lesson: Asking the right questions: decision trees

Logistic regression decides by drawing a single straight boundary and asking which side of it you fall on. That is elegant, but it is not how a person actually decides. A doctor working through a diagnosis does not compute a weighted sum. They ask a sequence of questions: Is there a fever? If so, is the cough dry? Is it worse at night? Each answer narrows things down until a conclusion is reached.

A decision tree learns exactly that kind of reasoning: a flowchart of yes/no questions that funnels an example down to a prediction. It is one of the most intuitive models in all of machine learning, because the thing it learns is something you can read, follow, and explain out loud.

A decision tree is a flowchart with three kinds of parts:

  • The root at the top: the first question asked of every example.
  • Internal nodes: follow-up questions, each one reached only if the answers above led there.
  • Leaves at the bottom: the predictions. For classification, a leaf is a class label; for a regression tree, it is a number.

To make a prediction, you start at the root, answer its question, follow the matching branch to the next question, and keep going until you land on a leaf. That leaf is the answer. No arithmetic, no boundary, just a path of questions.

Here is a small tree that decides whether to approve a loan:

[ Income > 50k? ]
/ \
no yes
/ \
[ Has collateral? ] [ Credit score > 650? ]
/ \ / \
no yes no yes
/ \ / \
DENY APPROVE DENY APPROVE

Now run two applicants through it:

Applicant A: income 60k, credit score 700
Income > 50k? yes -> Credit score > 650? yes -> APPROVE
Applicant B: income 40k, no collateral
Income > 50k? no -> Has collateral? no -> DENY

You just executed the model by hand, and you can explain each decision in plain words: A was approved because they earn over 50k and have strong credit; B was denied because they earn under 50k and have no collateral. That readability is the decision tree’s signature strength.

How the tree gets built: choosing the best question

Section titled “How the tree gets built: choosing the best question”

The tree above did not come from nowhere. The learning algorithm builds it by repeatedly answering one question: of all the questions I could ask here, which one best separates the classes?

“Best separates” has a precise meaning. A group of examples is pure if it is all one class (all approvals, or all denials) and impure if it is mixed. A perfect split sends all the approvals down one branch and all the denials down the other, producing two pure groups. A useless split leaves both branches just as mixed as before. The algorithm scores every candidate split by how much it reduces impurity (measured by a number called Gini impurity or entropy, where 0 means perfectly pure and the maximum means a 50/50 mix) and picks the split that reduces it the most.

Then it does the same thing again on each branch, and again on their branches, growing the tree one question at a time. Each node is just “find the single question that most purifies the groups below it.”

Left unchecked, this process has an obvious endpoint: keep splitting until every leaf holds a single training example. Such a tree would be perfectly correct on the training data and nearly useless on anything new, because it would have memorized every quirk and fluke rather than learning the general pattern. This is overfitting, the failure from lesson 1, in its most vivid form.

So real trees are reined in. You cap the depth, or require a minimum number of examples in a leaf, or grow the tree fully and then prune branches that do not earn their keep. The goal is a tree deep enough to capture the real structure but shallow enough to generalize. Where exactly that balance sits is the bias-variance question we take up in Phase 4.

Everything so far predicted a class, but the same flowchart idea predicts numbers too, and then it is called a regression tree. The structure is identical, a tree of yes/no questions, but two things change. Each leaf outputs a number, the average of the training values that land in it, instead of a class label. And the split criterion changes: instead of reducing class impurity, the tree chooses questions that reduce the variance of the values in each group, so each leaf ends up holding examples with similar target numbers. Predicting a house price, a regression tree might ask about size and neighborhood and then output the average price of the training houses that funnel into that leaf. Same machine, numeric output. It is worth knowing because the boosted-tree models later in this phase are very often built from regression trees, even when the final task is classification.

Decision trees have real strengths. They are interpretable, they capture non-linear patterns that a single straight boundary cannot, they handle numeric and categorical features together, and they need no rescaling of the inputs.

But a single decision tree has a serious weakness: it is unstable. Change a handful of training examples and you can get a completely different tree, because one altered split near the top reshuffles everything below it. A model whose structure swings wildly with small data changes is high-variance, and it tends to overfit. This instability is not a footnote; it is the exact problem the next lesson solves. The fix turns out to be wonderfully simple: instead of trusting one tree, grow many of them and let them vote.

Decision trees are not just a teaching example, they are the building block of the models that quietly win on real-world tabular data. Random forests and gradient-boosted trees, both built from many decision trees, are the workhorses for structured data in industry and competitions, often beating neural networks on spreadsheet-shaped problems. And trees offer something large models cannot: a decision you can fully audit. When a tree denies a loan, you can trace the exact path of questions that led there, which matters enormously anywhere a decision has to be explained or justified.

  • Letting the tree grow without limit. An unrestricted tree memorizes the training data. Depth limits, minimum leaf sizes, or pruning are not optional niceties.
  • Trusting a single tree as stable truth. One tree is high-variance; a slightly different dataset can produce a very different tree. Treat a lone tree’s exact structure with suspicion.
  • Assuming deeper is better. Past a point, extra depth fits noise, not signal. Deeper trees usually generalize worse, not better.
  • A decision tree is a flowchart of yes/no questions that funnels an example to a leaf, which holds the prediction.
  • It is built greedily, at each node choosing the question that most reduces impurity (best separates the classes).
  • An unrestrained tree overfits, so trees are limited by depth, leaf size, or pruning.
  • A single tree is interpretable but unstable (high-variance), which is exactly the weakness the next lesson fixes.

A lone decision tree gives you a readable model that captures non-linear patterns, but at the cost of instability and a strong tendency to overfit. The next lesson takes that flaw and turns it into a strength with a simple, powerful idea: grow a whole forest of trees, each on a slightly different view of the data, and average their votes. That is the random forest.