Squeezing dimensions: PCA

The first two lessons of this phase grouped unlabeled points: k-means flat, hierarchical as a tree. The other half of unsupervised learning is the opposite move. Instead of partitioning your data, you compress it: take points that each have many features and find a smaller set of new features that still capture most of what matters.

This is the right reach whenever you have too many features. A photo has tens of thousands of pixels; a survey has hundreds of questions; a gene-expression dataset has thousands of measurements per sample. You cannot plot 10,000-dimensional points, models slow down, many features are redundant or noisy, and the so-called curse of dimensionality makes patterns harder to find. The classic first tool for shrinking a dataset’s dimensions while keeping the signal is principal component analysis, almost always called PCA.

What PCA is looking for

PCA’s central question is simple to state: out of every possible direction through the data, which direction does the data vary along the most?

That direction is the first principal component (PC1): the single line through the data along which the points are most spread out. The second principal component, PC2, is the direction perpendicular to PC1 that captures the next-most variation. PC3 is perpendicular to both and captures the next-most after that, and so on, one new axis per original feature.

The trick is that PCs are listed in order of how much variation they capture. PC1 explains the most; PC2 less; PC3 less still. Often the first two or three PCs together already account for most of the spread in the data, so you can throw away the later ones and lose surprisingly little.

Why “most variation” equals “most information”

Watch how it works in two dimensions. Picture a cloud of points that forms a long, narrow oval, slanted across the page. The long axis of the oval is the direction of greatest variation, that is PC1. The short axis is PC2.

If the oval is very stretched, almost all the difference between the points lies along PC1; their positions along PC2 are nearly the same. So you can describe each point with just one number (where it sits along PC1) and only lose the small spread on PC2. You have gone from two features to one with almost no loss. Now scale that up: take 100 features, and you may find that 3 well-chosen directions hold 95 percent of the variation. You can plot or model those three instead of the original hundred.

How much you are keeping: variance explained

Each PC captures a measurable fraction of the total variance (spread). Plot those fractions for PC1, PC2, PC3, and so on, and you get what is called a scree plot. It always falls from left to right (each later PC captures less than the one before), and it often shows a clear elbow where adding more PCs stops mattering. A common rule of thumb is to keep enough PCs to cover 90 to 95 percent of the variance, then drop the rest. That percentage tells you what you traded.

What a principal component actually is

A principal component is not a new measurement you go take in the world; it is a recipe. Each PC is a weighted combination of the original features, like “PC1 = 0.6 * income + 0.4 * spending - 0.1 * age.” The weights, called loadings, tell you which original features contribute most to that PC, which gives a thin slice of interpretability: if PC1’s biggest loadings are on income and spending, you can read PC1 as “something like an overall economic-activity score.”

The math under the hood finds these directions from the data’s covariance structure, but the intuition is the right thing to carry: PCs are directions in feature space, ordered by how much of the data’s variation they capture.

Worked picture: from two dimensions to one

Take a small dataset of 2D points that lie roughly along a diagonal line:

points (x, y): (1,1), (2,2), (3,3), (4,4), (2,3), (3,2)
                                          ^^^^^ ^^^^^ small jitter off the line

The points stretch from lower-left to upper-right; they barely vary in the perpendicular direction. PCA on this would discover:

PC1 points along the diagonal (the direction of most spread).
PC2 points perpendicular to it (the small jitter direction).

If you project every point onto PC1, you get a single number per point (its position along the diagonal), and the original 2D arrangement is almost perfectly preserved. You have gone from two features to one with very little loss. That is the move PCA makes, but in real datasets, the original number of features may be a hundred or a million.

When to reduce dimensions, and why

PCA is the standard reach for several jobs:

Visualization. You cannot plot 100-dimensional data, but you can plot it on PC1 and PC2 and often see the clusters, gradients, or outliers that were hiding in the full feature set.
Speed. Fewer features means faster training for whatever model you fit next.
De-noising. Low-variance PCs often capture noise rather than signal. Dropping them can actually improve a downstream model.
Preprocessing. PCA before clustering, before classification, before plotting, all routine.

The honest cautions matter too. PCs are mixtures of the original features, so they are less interpretable than the originals you started with. PCA assumes the interesting structure lives in the high-variance directions, which is usually a fair assumption but not always (sometimes a small but meaningful pattern hides on PC7). And PCA is linear: it finds straight axes, so it cannot capture curved or clustered structure cleanly. That last limit is exactly what the next lesson, t-SNE, was built to work around.

A scaling gotcha

One practical point that bites every newcomer. PCA chases variance, so a feature measured in large numbers (income in dollars) will appear to vary far more than one in small numbers (height in meters) even if both carry similar information. The remedy is to standardize the features first, rescale them so each has a mean of zero and a variance of one, before running PCA. Skipping that step lets the large-scale features hijack the principal components.

Why this matters when you use AI

PCA is one of the most quietly used techniques in the field. It is the default first move for exploring an unfamiliar high-dimensional dataset, the standard way to visualize the high-dimensional vector embeddings that language models produce (project them to 2D with PCA, then look), and a common preprocessing step before another model. When someone shows you a 2D plot of a million-dimensional dataset with clear clusters or gradients, PCA is often what made the picture possible.

Common pitfalls

Forgetting to standardize the features. PCA is variance-driven and scale-sensitive. Unstandardized features distort the components.
Treating a PC as if it were an original feature. Each PC is a mixture; reading too much into “PC3 went up” needs the loadings to back it.
Assuming high variance equals signal. Usually true, sometimes not. The interesting pattern may live on a smaller PC.
Using PCA when structure is non-linear. PCA flattens curved or clustered shapes that lie on a non-flat surface. For visualization of those, reach for a nonlinear method like the one in the next lesson.

What you should remember

PCA finds directions of maximum variance in the data, ordered: PC1 captures the most, PC2 the next-most, and so on.
Each PC is a weighted combination of the original features; the loadings tell you which features matter for it.
Keeping the first few PCs compresses many features into a few while preserving most of the variation, often 90 to 95 percent with just 2 or 3.
It is linear and scale-sensitive: standardize the features first, and reach for a nonlinear method if the structure curves.

PCA gives you compression and a great preprocessing tool, but it is a linear method, and “the directions with the most variance” is not the same as “the layout that best shows the clusters.” When what you really want is a 2D picture that makes the natural groups jump out, you want a method built specifically for visualization. The next and final lesson of this phase is exactly that method: t-SNE.