Skip to content

Lesson: Building a hierarchy: hierarchical clustering

K-means had one awkward demand: you had to name the number of clusters before you started. But often you do not know how many groups there are, and worse, the data may have structure at several scales at once: a few big groups, each with smaller sub-groups inside. A single cluster count k cannot capture that.

Hierarchical clustering takes a different stance. Instead of committing to a number of clusters, it builds the entire family tree of the data, from every point being its own tiny cluster all the way up to one cluster holding everything. The output is not a flat set of groups but a tree you can read at any level, and you decide how many clusters you want afterward, by choosing where to cut that tree.

The standard approach is agglomerative, meaning bottom-up, and it is simple:

1. Start with every point as its own cluster.
2. Find the two closest clusters and merge them into one.
3. Repeat step 2 until everything is in a single cluster.

Each merge is recorded: which two clusters joined, and how far apart they were when they did. That record of merges, in order, is the hierarchy. (A top-down “divisive” variant exists, splitting one big cluster down, but agglomerative is the common one.)

The merge history is drawn as a dendrogram, a tree diagram that is the whole point of the method. Individual points are the leaves along the bottom. Each merge is a connector joining two clusters, drawn at a height equal to the distance between them when they merged. Points that are close together join low on the diagram; groups that are far apart join high.

That height is the information. A merge low down means “these were very similar.” A merge high up means “these two groups were only joined because everything has to merge eventually, but they are quite different.” Reading the heights tells you the structure.

Here is a dendrogram for five points, A through E:

height
6 | ___________________
| | |
2 | __|__ |
| | | |
1 | __|__ | __|__
| | | | | |
0 | A B C D E

Read it from the bottom. A and B merged first, at height 1 (they were closest). D and E also merged at height 1. Then C joined the A-B group at height 2. Finally, the whole A-B-C group merged with the D-E group, but only way up at height 6, which tells you those two groups are very different from each other.

To turn the tree into actual clusters, you cut it with a horizontal line. The number of vertical branches the line crosses is the number of clusters, and each branch below a crossing is one cluster:

  • Cut at height 4 (anywhere between 2 and 6): the line crosses 2 branches, giving two clusters, {A, B, C} and {D, E}.
  • Cut at height 1.5 (between 1 and 2): the line crosses 3 branches, giving {A, B}, {C}, and {D, E}.

Cut low and you get many small, tight clusters; cut high and you get a few broad ones. The natural place to cut is across the tallest vertical gap, a long stretch where no merges happen, because that gap is the data telling you the groups on either side are genuinely far apart. Here the big gap is between height 2 and height 6, so cutting there (two clusters) is the most natural reading.

Measuring distance between clusters: linkage

Section titled “Measuring distance between clusters: linkage”

Step 2 says “merge the two closest clusters,” but once clusters hold several points, what does “closest” mean? That choice is called the linkage, and the common options are:

  • Single linkage: the distance between the two nearest points across the clusters. Can produce long, chained clusters.
  • Complete linkage: the distance between the two farthest points. Favors compact, tight clusters.
  • Average linkage: the average distance over all pairs.
  • Ward’s linkage: merge the pair that increases the total within-cluster spread the least.

The linkage you pick changes the shape of the tree, so it is a real choice, not a detail. Ward’s and average linkage are common defaults.

Underneath linkage sits a more basic choice: how you measure the distance between two individual points in the first place, usually straight-line (Euclidean) distance. And because the whole method is built on distances, it has the same sensitivity to scale as k-means and support vector machines: a feature measured in large numbers will dominate the distances and warp the tree, so you rescale features to comparable ranges before clustering.

The two clustering methods trade off cleanly:

K-means Hierarchical
choose k up front yes no (cut the tree later)
output flat set of k groups a full tree (dendrogram)
multi-scale view no yes, structure at every level
speed / scale fast, scales well slower; struggles on large data
merges n/a greedy and irreversible

Reach for hierarchical clustering on smaller datasets, when you do not know how many clusters to expect, or when the multi-scale structure itself is what you want to see, as in biology, where dendrograms of genes or samples are everyday tools.

Dendrograms are a staple of exploratory data analysis. In bioinformatics they sit alongside heatmaps to group genes and samples; in document analysis they organize texts into nested topics; anywhere you want to see the structure of unlabeled data rather than just partition it, the tree is more revealing than a flat list of groups. The ability to read structure at multiple scales, and to decide the number of clusters after looking rather than before, is exactly what flat methods like k-means cannot give you.

  • Reading membership without choosing a cut. A dendrogram alone is not a set of clusters. You have to pick a cut height to get groups.
  • Misreading left-to-right position. Only the merge heights carry meaning. Two leaves sitting side by side at the bottom are not necessarily similar; the branches can be rotated freely. Look at the height where they join, not their horizontal nearness.
  • Ignoring the linkage choice. Different linkages can produce quite different trees. The default is a decision, so make it deliberately.
  • Using it on huge datasets. Hierarchical clustering is computationally heavy and does not scale to very large data the way k-means does.
  • Hierarchical clustering builds a tree of nested groups, merging the two closest clusters repeatedly from singletons up to one big cluster, with no cluster count k chosen in advance.
  • The dendrogram records the merges, drawing each at a height equal to how far apart the merged clusters were. Height is the meaning.
  • You get clusters by cutting the tree with a horizontal line; cut across the tallest gap for the most natural grouping.
  • Linkage defines cluster-to-cluster distance (single, complete, average, Ward’s), and the choice shapes the tree.

Clustering, in both flavors, was about grouping similar points together. The next lesson turns to a different unsupervised goal: not grouping the data, but compressing it. When each point has dozens or hundreds of features, how do you boil them down to a handful that still capture most of what matters? That is dimensionality reduction, and it begins with the most important technique in the family: principal component analysis.