Cheatsheet: Seeing high-dimensional data: t-SNE
What t-SNE is
Section titled “What t-SNE is”| Item | Detail |
|---|---|
| Purpose | visualization only (not preprocessing) |
| Output | a 2D layout where similar high-D points appear near each other |
| How it works | match 2D pairwise similarities to high-D pairwise similarities, iteratively |
| Strong example | MNIST: PCA smears digits, t-SNE produces 10 separated blobs |
What it preserves (and does not)
Section titled “What it preserves (and does not)”| Aspect | Preserved? |
|---|---|
| Local structure (who is near whom inside a cluster) | yes |
| Global structure (distance/arrangement between clusters) | no |
| Cluster size on the page | no (artifact of layout) |
| Layout across runs | no (varies with seed) |
Three misreadings to avoid
Section titled “Three misreadings to avoid”| Misreading | What’s actually true |
|---|---|
| ”Clusters drawn close together are more related” | between-cluster distance is meaningless |
| ”Big cluster = more variation” | cluster size on the page is meaningless |
| ”Same data always gives the same picture” | different seeds give different layouts |
Perplexity
Section titled “Perplexity”| Setting | Effect |
|---|---|
| Too low (e.g., 5 on a non-tiny dataset) | many tiny fragmented clusters |
| Too high (e.g., 100+) | everything blurs into one blob |
| Typical range | 5 to 50 |
| Best practice | try several; trust clusters that appear stably |
t-SNE vs PCA
Section titled “t-SNE vs PCA”| PCA | t-SNE | |
|---|---|---|
| Goal | compression (keep variance) | visualization (preserve local neighborhoods) |
| Linearity | linear | nonlinear |
| Use for modeling? | yes | no |
| Preserves global structure | yes | no |
| Preserves local clusters | weakly | strongly |
Cousin: UMAP
Section titled “Cousin: UMAP”| Note | |
|---|---|
| Often faster than t-SNE | preserves more global structure than t-SNE |
| When you have a choice | try both |