Practice: Grouping without labels: k-means clustering

Self-check

Seven short questions. Try to answer each one before opening the collapsible.

1. How is clustering different from everything in the first two phases?

Show answer

It is unsupervised: there are no labels, no answer key. Instead of predicting a known answer, clustering finds the natural groups hiding in unlabeled data.

2. What is a centroid?

Show answer

The center of a cluster: the average position of all the points assigned to it. K-means summarizes each cluster by its centroid.

3. State the two repeating steps of k-means.

Show answer

Assign: put each point in the cluster of its nearest centroid. Update: move each centroid to the average position of its assigned points. Repeat until assignments stop changing.

4. How do you know when to stop?

Show answer

When an assignment round changes nothing: no point switches clusters. The centroids have settled.

5. What is the elbow method for?

Show answer

Choosing k. Run k-means for several values of k, measure how tight the clusters are for each, and pick the k at the “elbow” where adding more clusters stops meaningfully improving tightness.

6. What is the most important caution about k-means clusters?

Show answer

K-means always returns k clusters whether or not real groups exist. Hand it noise and ask for three clusters, and it produces three. Judging whether the clusters are meaningful is your job, not the algorithm’s.

7. Name two limitations of k-means besides “you must choose k.”

Show answer

Any two: sensitive to the starting centroids (a bad init gives a poor result), assumes roughly round and similar-sized clusters, and is distance-based so it needs feature scaling.

Try it yourself: run one round

Points on a line, with k = 2 and centroids starting at 4 and 21:

points: 1, 3, 5, 20, 22, 24

Do one full round: assign each point to its nearest centroid, then update each centroid to the mean of its assigned points. Then say whether the next round would change anything.

Show answer

ASSIGN (to nearest of 4, 21):
  1 -> A (dist 3 vs 20)
  3 -> A (dist 1 vs 18)
  5 -> A (dist 1 vs 16)
  20, 22, 24 -> B
  clusters: A = {1, 3, 5}, B = {20, 22, 24}

UPDATE:
  A -> mean(1, 3, 5)  = 3
  B -> mean(20, 22, 24) = 22

The centroids move from 4 and 21 to 3 and 22. Would the next round change anything? No: with centroids at 3 and 22, every point still picks the same cluster, so the assignments are stable and k-means has converged.

Try it yourself: are these segments real?

You run k-means with k = 4 on customer data and get four clusters, each with a centroid. A colleague says, “Great, our customers fall into four segments.” What is the catch, and what should you check before agreeing?

Show answer

The catch: k-means always returns four clusters when you ask for four. Getting four clusters out is not evidence that four real segments exist; the algorithm would have split even structureless data into four. Before agreeing, check that the grouping is actually meaningful: use the elbow method to see whether four is a justified choice (or whether two or six fit far better), and inspect whether the clusters genuinely differ in ways that matter (different behavior, different value), not just arbitrary slices. The number of clusters was your assumption, not the data’s verdict.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.

Q. What is clustering?

An unsupervised task: finding the natural groups in unlabeled data, with no answer key. K-means is the workhorse algorithm.

Q. What is a centroid?

The center of a cluster: the average position of all points assigned to it.

Q. What are the two repeating steps of k-means?

Assign each point to its nearest centroid, then update each centroid to the mean of its assigned points. Repeat until assignments stop changing.

Q. When does k-means stop?

When a round of assignment changes nothing: no point switches clusters and the centroids have settled.

Q. What is the elbow method?

A way to choose k: run k-means for several k, measure cluster tightness, and pick the k where improvement flattens out (the elbow).

Q. What is the key caution about k-means clusters?

It always returns k clusters whether real groups exist or not. It cannot tell you the clusters are meaningful; that judgment is yours.

Q. Why is k-means sensitive to initialization?

A poor set of starting centroids can settle into a bad grouping. It is run several times from different starts (and k-means++ seeds smartly), keeping the best.

Q. What cluster shape does k-means assume?

Roughly round, similar-sized clusters, because it judges everything by distance to a center. Elongated or very unequal groups trip it up.

Q. Why must you scale features for k-means?

It is distance-based, so an unscaled large-range feature dominates the distances and distorts the clusters. Rescale first.

Q. How does clustering connect to modern AI?

The vector embeddings language models produce are clustered to group items by meaning, the same assign-and-update idea behind “find documents similar to this one.”