Skip to content

Cheatsheet: Seeing high-dimensional data: t-SNE

ItemDetail
Purposevisualization only (not preprocessing)
Outputa 2D layout where similar high-D points appear near each other
How it worksmatch 2D pairwise similarities to high-D pairwise similarities, iteratively
Strong exampleMNIST: PCA smears digits, t-SNE produces 10 separated blobs
AspectPreserved?
Local structure (who is near whom inside a cluster)yes
Global structure (distance/arrangement between clusters)no
Cluster size on the pageno (artifact of layout)
Layout across runsno (varies with seed)
MisreadingWhat’s actually true
”Clusters drawn close together are more related”between-cluster distance is meaningless
”Big cluster = more variation”cluster size on the page is meaningless
”Same data always gives the same picture”different seeds give different layouts
SettingEffect
Too low (e.g., 5 on a non-tiny dataset)many tiny fragmented clusters
Too high (e.g., 100+)everything blurs into one blob
Typical range5 to 50
Best practicetry several; trust clusters that appear stably
PCAt-SNE
Goalcompression (keep variance)visualization (preserve local neighborhoods)
Linearitylinearnonlinear
Use for modeling?yesno
Preserves global structureyesno
Preserves local clustersweaklystrongly
Note
Often faster than t-SNEpreserves more global structure than t-SNE
When you have a choicetry both