Practice: Seeing high-dimensional data: t-SNE

Self-check

Seven short questions. Try to answer each one before opening the collapsible.

1. What is t-SNE for, and what is it NOT for?

Show answer

It is for visualization: producing a 2D picture in which similar high-dimensional points appear near each other. It is not for preprocessing: do not feed t-SNE output into a downstream model.

2. How does t-SNE work, at the intuition level?

Show answer

It measures pairwise similarities in the high-dimensional data, then iteratively shuffles 2D positions until the 2D similarities match the high-D ones as closely as possible. Similar points are pulled together; dissimilar points are pushed apart.

3. What does t-SNE preserve, and what does it not preserve?

Show answer

It preserves local structure: who is near whom inside a cluster. It does not preserve global structure: the distances or arrangement between clusters on the page are largely artifacts of the optimization, not measurements.

4. Why is the distance between two clusters on a t-SNE plot meaningless?

Show answer

t-SNE focuses on local neighbors; it does not preserve global distances. Two clusters drawn close together on the page are not necessarily more related than two drawn far apart; the layout is incidental.

5. What does cluster size on a t-SNE plot tell you?

Show answer

Almost nothing. t-SNE expands tight clusters and shrinks loose ones to lay them out. A bigger blob is not necessarily looser or more spread out in the original data.

6. What does the perplexity parameter control, and what range is typical?

Show answer

Roughly, how many neighbors each point pays attention to. Too low fragments the picture into many tiny clusters; too high blurs everything together. Common values are between 5 and 50.

7. Why should you run t-SNE more than once with different seeds and perplexities?

Show answer

Because a single run is one of many possible layouts. The clusters that appear consistently across runs are usually real; clusters that show only at one setting should be treated with suspicion.

Try it yourself: valid or misreading?

A t-SNE plot shows four well-separated blobs of points (call them blobs 1, 2, 3, 4). Label each statement VALID or MISREADING.

A. Points inside blob 1 are similar to each other in the original data.
B. Blob 1 is drawn close to blob 2 and far from blob 4, so blob 1 is more
   similar to blob 2 than to blob 4.
C. Blob 3 looks twice as large as blob 1, so blob 3 has more variation in
   the original data.
D. If we re-run t-SNE with a new random seed, we will see the same four
   blobs but possibly arranged differently on the page.

Show answer

A: VALID. Points clustering together in a t-SNE plot are similar in the high-dimensional data; that is what the method is built to show.
B: MISREADING. Between-cluster distance on a t-SNE plot is largely arbitrary. You cannot read closeness of blobs as similarity of groups.
C: MISREADING. Cluster size on a t-SNE plot is not a measure of variation. t-SNE expands and shrinks clusters to lay them out.
D: VALID. The same clusters typically reappear (because they are real structure), but the arrangement of clusters across the page can change a lot with a new seed.

The single rule to carry: trust the clustering; do not trust the layout.

Try it yourself: tune the perplexity

You run t-SNE on a dataset:

With perplexity = 5, you see 50 tiny scattered clusters.
With perplexity = 100, you see one big blurry blob.

What is happening, and what would you try next?

Show answer

Perplexity 5 is too low: each point pays attention to only its handful of closest neighbors, so the picture fractures into many tiny clusters that may not reflect real groups. Perplexity 100 is too high for the dataset: each point attends to so many neighbors that distinctions wash out and everything blurs together. The structure you want sits somewhere in between.

Try several values in the standard range, say 30, 40, and 50, and look for cluster structure that is stable across multiple settings. Stable clusters are likely real; clusters that appear at only one perplexity should be treated as suspicious. Re-running with a few different random seeds at each perplexity is also worth the cost.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.

Q. What is t-SNE for?

Visualization only: producing a 2D picture in which similar high-dimensional points end up near each other so clusters jump out.

Q. Why is t-SNE not a preprocessor?

Its 2D output is shaped to look good on a page, not to be a good representation for downstream modeling. Use PCA or similar to reduce; use t-SNE to see.

Q. How does t-SNE work, intuitively?

It measures pairwise similarities in high-D, then iteratively shuffles 2D positions so the 2D similarities match the high-D ones: similar points pull together, dissimilar push apart.

Q. What does t-SNE preserve?

Local structure: who is near whom inside a cluster. Nearest neighbors in high-D stay nearby in 2D.

Q. What does t-SNE NOT preserve?

Global structure: the distances between clusters and their arrangement on the page are largely arbitrary, not measurements.

Q. Why is cluster-to-cluster distance meaningless on a t-SNE plot?

t-SNE optimizes for local neighbors and ignores global distances. Two clusters drawn close together are not necessarily more related than two far apart.

Q. What does cluster size on a t-SNE plot tell you?

Almost nothing. t-SNE expands tight clusters and shrinks loose ones to lay them out. Do not read size as variation or density.

Q. What is perplexity?

The main tuning knob: roughly, how many neighbors each point pays attention to. Common values 5 to 50. Too low fractures; too high blurs.

Q. Why run t-SNE multiple times?

A single run is one of many possible layouts. Clusters appearing stably across different seeds and perplexities are usually real; one-off clusters are suspicious.

Q. What is UMAP, in one line?

A faster sibling of t-SNE that often preserves more global structure. A common alternative; try both when you can.