Skip to content

Grouping without labels: k-means clustering

This is lesson 9 of Track 10, the opener of Phase 3 (Finding structure without labels). By the end you will be able to walk through the k-means clustering loop by hand, from a starting guess to convergence, and judge when clustering is the right tool and when it will mislead you. The one capability to walk away with: run the assign-and-update loop yourself, and recognize that k-means always returns the number of clusters you ask for, real or not.

The track structurally mirrors StatQuest’s intuition-first machine learning videos, with Microsoft’s “ML For Beginners” as the hands-on companion for readers who want to build the models in code. Full attribution is in this lesson’s references.

This lesson opens the unsupervised phase. Phases 1 and 2 were entirely supervised: every model learned from labeled answers. Here the labels are gone, and the goal shifts from predicting a known answer to discovering structure in raw data. K-means is the natural starting point, the simplest and most widely used clustering method. The next lesson, hierarchical clustering, tackles the same job without making you choose the number of clusters in advance.

Prerequisite: Lesson 1, What machine learning actually is. You need the distinction between supervised and unsupervised learning, because this lesson is the first to work with unlabeled data, exactly the unsupervised case lesson 1 described. No math beyond computing averages and comparing distances.

  • Explain how clustering differs from supervised learning
  • Walk through the k-means assign-and-update loop to convergence
  • Use the elbow method to choose k
  • Explain why k-means always returns k clusters and why that demands judgment
  • Name the main limitations of k-means
  • Read time: about 12 minutes
  • Practice time: about 15 minutes (a by-hand iteration exercise, a judgment question, and flashcards)
  • Difficulty: standard