Skip to content

Summary: Grouping without labels: k-means clustering

K-means finds groups in unlabeled data by repeating two steps: assign each point to its nearest centroid, then move each centroid to the mean of its points, until nothing changes. It opens the unsupervised phase, where there are no labels and the goal is to discover structure rather than predict a known answer. This summary is the scan version of the full lesson.

  • Unsupervised, no answer key. Clustering finds the natural groups in unlabeled data. K-means is the workhorse.
  • Centroid = cluster center, the average position of the points in a cluster.
  • The loop: choose k, place k centroids, then repeat ASSIGN (each point to its nearest centroid) and UPDATE (each centroid to the mean of its points) until assignments stop changing.
  • It converges fast, even from a poor start, by drifting the centroids toward the middle of the natural groups.
  • You must choose k. The elbow method helps: run several values of k, plot cluster tightness, pick the k where improvement flattens.
  • The crucial caution: k-means always returns k clusters, real groups or not. It cannot tell you the clusters mean anything; that judgment is yours.
  • Limitations: sensitive to initialization (run it several times), assumes round and similar-sized clusters, and is distance-based so it needs feature scaling.

Clustering is the standard first move on unlabeled data, which is most data: customer segments, grouping similar items, exploring a fresh dataset, organizing an archive nobody labeled. Knowing the assign-update loop demystifies “the algorithm found these segments”, it is just nearest-centroid grouping repeated to convergence. The most valuable thing to carry, though, is the skepticism: because k-means always returns exactly the number of clusters you ask for, the clusters existing is never proof the groups are real. That habit, of asking “are these groups meaningful or just imposed?”, protects you from a lot of confident nonsense. It also connects to modern AI, where the embeddings language models produce are clustered to group things by meaning. The next lesson clusters without committing to a number of groups at all, building a whole hierarchy: hierarchical clustering.