Cheatsheet: Grouping without labels: k-means clustering
The setup
Section titled “The setup”| Term | Meaning |
|---|---|
| Clustering | unsupervised: find natural groups in unlabeled data |
| k | the number of clusters (you choose it) |
| Centroid | a cluster’s center: the mean position of its points |
The loop
Section titled “The loop”| Step | Action |
|---|---|
| 0 | choose k, place k centroids (often random) |
| 1. Assign | each point joins its nearest centroid’s cluster |
| 2. Update | move each centroid to the mean of its assigned points |
| Repeat | steps 1-2 until assignments stop changing |
Worked trace (points 1,2,3,10,11,12; k=2; init centroids 2,3)
Section titled “Worked trace (points 1,2,3,10,11,12; k=2; init centroids 2,3)”| Iter | Assign | New centroids |
|---|---|---|
| 1 | A={1,2}, B={3,10,11,12} | 1.5, 9 |
| 2 | A={1,2,3}, B={10,11,12} | 2, 11 |
| 3 | no change | converged |
Choosing k and the big caution
Section titled “Choosing k and the big caution”| Idea | Note |
|---|---|
| Elbow method | run several k, plot cluster tightness, pick the bend |
| Always returns k clusters | even on noise; finding k groups is NOT proof they are real |
| Your job | judge whether the clusters are meaningful |
Limitations
Section titled “Limitations”| Limitation | Note |
|---|---|
| Must choose k | strongly shapes the result |
| Init-sensitive | run several times; k-means++ seeds smartly |
| Assumes round clusters | struggles with elongated or unequal groups |
| Distance-based | scale features first |