Skip to content

Squeezing dimensions: PCA

This is lesson 11 of Track 10, in Phase 3 (Finding structure without labels). By the end you will be able to explain what principal components are and why you would reduce the dimensions of a dataset before modeling or visualizing it. The one capability to walk away with: given a dataset with many features, say what PCA would do with it, decide how many components to keep using a scree plot, and read a small loadings table to interpret what a PC stands for.

The track structurally mirrors StatQuest’s intuition-first machine learning videos, with Microsoft’s “ML For Beginners” as the hands-on companion for readers who want to build the models in code. Full attribution is in this lesson’s references.

This is the third lesson in the unsupervised phase. The first two (k-means and hierarchical clustering) grouped unlabeled points; this one switches to the other major unsupervised goal, compression: reducing many features to a few. The next and final lesson of the phase, t-SNE, takes on dimensionality reduction for the specific purpose of visualizing clusters, picking up where PCA’s linear axes fall short.

Prerequisite: Lesson 1, What machine learning actually is. You need the idea of features (each data point described by several numbers) and unsupervised learning (no labels, find structure), because PCA reduces the number of features in unlabeled data. No prior knowledge of linear algebra or eigenvectors required; the lesson works at the level of “directions of maximum variance” without derivations.

  • Explain what a principal component is and how PCs are ordered
  • Use variance explained and a scree plot to choose how many PCs to keep
  • Read a loadings table to interpret an individual PC
  • Name when PCA is the right tool, and when it is not
  • Recognize the linear / scale-sensitive limits and standardize features first
  • Read time: about 12 minutes
  • Practice time: about 15 minutes (a scree-plot reading exercise, a loadings interpretation, and flashcards)
  • Difficulty: standard