Practice: Squeezing dimensions: PCA

Self-check

Seven short questions. Try to answer each one before opening the collapsible.

1. In one sentence, what is the first principal component?

Show answer

The single direction through the data along which the points vary the most: PC1 is the “main axis” of the data cloud.

2. How are the later principal components defined?

Show answer

PC2 is the direction perpendicular to PC1 that captures the most remaining variation; PC3 is perpendicular to both and captures the next-most; and so on. They are ordered by how much variance each one explains.

3. Why does “most variation” generally mean “most information”?

Show answer

Because the variation between points is what distinguishes them. A direction with little variation has nearly identical values for every point and carries almost no information; the direction of greatest spread captures most of what makes the points different.

4. What is a principal component, mechanically?

Show answer

A weighted combination of the original features. The weights, called loadings, tell you which original features contribute most to that PC, which gives a thin layer of interpretability.

5. Why must you standardize features before running PCA?

Show answer

Because PCA chases variance, and a feature on a large numeric scale would appear to vary far more than a small-scale one (even if both carry similar information), hijacking the principal components. Standardizing each feature to mean 0 and variance 1 removes that bias.

6. Name two jobs PCA is the right tool for.

Show answer

Any two of: visualizing high-dimensional data (plot PC1 vs PC2), speeding up downstream models by reducing features, de-noising by dropping low-variance PCs, or preprocessing before clustering or classification.

7. What kind of structure can PCA NOT capture, and why?

Show answer

Non-linear (curved or clustered) structure. PCA finds straight axes of variation, so data that lies on a curved surface gets flattened. Visualizing such structure needs a nonlinear method like t-SNE (the next lesson).

Try it yourself: read the scree plot

A PCA on a six-feature dataset gives this variance explained per PC:

PC1: 60%
PC2: 25%
PC3:  8%
PC4:  4%
PC5:  2%
PC6:  1%

How many PCs do you keep to capture at least 90% of the total variance? Where is the “elbow” in the scree plot?

Show answer

Adding from the top: PC1+PC2 = 85% (not enough), PC1+PC2+PC3 = 93% (over 90%). So keep the first 3 PCs, reducing six features to three while retaining 93% of the variance.

The elbow sits between PC3 and PC4: the per-PC contribution drops from 8% to 4%, and the later PCs contribute little. The data effectively lives in about 3 dimensions, even though it was recorded in 6.

Try it yourself: read the loadings

PCA on a customer dataset gives this loadings table for PC1:

income    :  0.70
spending  :  0.60
age       : -0.30
height    :  0.05

What can you read about PC1 from these loadings?

Show answer

PC1 is driven mainly by income and spending (both with large positive loadings, around 0.6 to 0.7). Age has a smaller negative loading (about -0.3), meaning PC1 moves down slightly as age rises. Height (0.05) is essentially absent from PC1, so it contributes nothing to this direction.

Read together, PC1 looks like an “economic-activity” axis: high PC1 means high income and high spending, with a slight tendency toward younger ages. That is the kind of interpretive sentence loadings let you write, with appropriate hedging since a PC is still a mixture, not an original feature.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.

Q. What is the first principal component (PC1)?

The single direction through the data along which points vary the most: the data cloud’s main axis.

Q. How are PC2, PC3, ... defined?

Each is perpendicular to all previous PCs and captures the most remaining variation. PCs are ordered by variance explained, largest first.

Q. Why does maximizing variance equal preserving information?

Variation is what distinguishes points. A near-constant direction carries almost nothing; the high-variance directions hold most of what makes the points different.

Q. What is a principal component, mechanically?

A weighted combination of the original features. The weights (loadings) tell you which features contribute most to that PC.

Q. What is variance explained?

The fraction of the data’s total spread captured by a given PC. Plotted across PCs it forms a scree plot, used to decide how many PCs to keep.

Q. A rule of thumb for how many PCs to keep?

Enough to cover about 90 to 95 percent of total variance, or up to a clear elbow in the scree plot.

Q. Why must you standardize features before PCA?

PCA is variance-driven and scale-sensitive. Without standardization, large-scale features hijack the components. Rescale each feature to mean 0 and variance 1.

Q. Name two uses for PCA.

Any two: visualization (plot PC1 vs PC2), speeding up downstream models, de-noising by dropping low-variance PCs, or preprocessing before clustering or classification.

Q. What can PCA NOT capture?

Non-linear, curved, or cluster-shaped structure. PCA finds straight axes only; nonlinear methods like t-SNE are needed for those.

Q. Why are PCs less interpretable than original features?

Each PC is a mixture (weighted combination) of the original features, so it does not have a single direct meaning the way “income” or “age” does.