Skip to content

Summary: Building a hierarchy: hierarchical clustering

Hierarchical clustering builds a tree of nested groups instead of a flat set: merge the two closest clusters over and over, from singletons up to one big cluster, then cut the tree wherever you like to get your clusters. No k is chosen in advance, and the dendrogram shows structure at every scale. This summary is the scan version of the full lesson.

  • Agglomerative, bottom-up. Start with every point as its own cluster, repeatedly merge the two closest clusters, continue until one cluster remains. Each merge and its distance is recorded.
  • The dendrogram draws that merge history as a tree. Leaves are points; each merge is drawn at a height equal to the distance between the clusters joined. Low merge = similar; high merge = different.
  • Cutting the tree turns it into clusters: a horizontal line crosses some branches, and each branch below a crossing is one cluster. Cut low for many small clusters, high for few large ones.
  • Cut across the tallest gap for the most natural grouping; a long stretch with no merges means the groups on either side are genuinely far apart.
  • Linkage defines cluster-to-cluster distance: single (nearest points), complete (farthest), average, or Ward’s. The choice changes the tree.
  • Versus k-means: no k up front, a full multi-scale tree, but slower and poor on very large data; k-means is fast and flat but makes you pick k.

The dendrogram is one of the most useful pictures in unsupervised learning, and now you can read one: the heights tell you how different groups are, and the cut you choose decides how many clusters you get, after looking rather than before. That alone makes hierarchical clustering the better tool whenever you do not know how many groups to expect, or when the nested structure itself is the insight (a staple in biology, where gene and sample dendrograms accompany heatmaps). The one misreading to avoid is treating the left-to-right order of leaves as meaningful; only merge height matters, since the branches can be flipped freely. The next lesson leaves clustering behind for the other great unsupervised goal, compressing many features down to a few that still capture the signal: principal component analysis.