Summary: Seeing high-dimensional data: t-SNE
t-SNE produces a 2D picture in which similar high-dimensional points end up near each other, revealing clusters that PCA’s straight axes can hide, but the picture is deceptive in specific ways and reading it without over-reading it is the whole skill. It is the source of most colorful blob-of-clusters figures you see in modern AI. This summary is the scan version of the full lesson, which closes the unsupervised phase.
Core ideas
Section titled “Core ideas”- Visualization-only. t-SNE is for plotting, not preprocessing. Use PCA or similar to reduce data for a model; use t-SNE to see it.
- How it works: measure pairwise similarities in high-D, then iteratively shuffle 2D positions until 2D similarities match. Similar points are pulled together; dissimilar ones pushed apart.
- The gift: clusters that PCA flattens often appear as clearly separated blobs in t-SNE (MNIST digits are the classic example: 10 clear blobs).
- The catch: t-SNE preserves local structure (neighbors stay near) and does not preserve global structure. Three misreadings to avoid:
- Cluster-to-cluster distance on a t-SNE plot is meaningless.
- Cluster size on a t-SNE plot is meaningless.
- Different runs produce different layouts; trust what is stable across runs.
- Perplexity is the main knob: roughly how many neighbors each point attends to. Common values 5 to 50; too low fractures, too high blurs.
- UMAP is a faster sibling that often preserves more global structure; worth trying alongside t-SNE.
What changes for you
Section titled “What changes for you”You have almost certainly seen t-SNE plots before, word embeddings clustered by topic, image features grouped by category, gene-expression cells separating into types. The skill this lesson gives you is reading them honestly: trust the clusters (the fact that points grouped together usually reflects real similarity), but resist the urge to read meaning into how those clusters are arranged on the page or how big they appear. That small literacy skill protects against a lot of confidently-wrong AI claims. With this, Phase 3 closes, and we have the full unsupervised toolkit: clustering, in two flavors, and dimensionality reduction, linear and nonlinear. The final phase of the track turns to the question hovering over everything we built: how do you know if any of these models is actually any good?