Correlation, in brief

What you’ll learn

This is lesson 4 of Track 9 (Statistics & Probability for AI) and the close of Phase 1 (Describing data). The first three lessons looked at one variable at a time; this one is the first look at two variables together. You will learn to read a scatterplot, to interpret the correlation coefficient that measures how tightly two quantities move together, and, just as importantly, to resist the trap the tool sets: reading cause into co-variation. The source curriculum is Khan Academy’s Statistics & Probability course, by Sal Khan and the Khan Academy team, freely available and cited as further study.

The lesson opens with the ice-cream-and-drownings story, builds the scatterplot and the correlation coefficient (sign for direction, magnitude for strength, tied back to the standardized z-scores from earlier), flags that the coefficient sees only straight lines, and then spends real time on correlation-is-not-causation and the confounder. It closes on where correlation shows up in machine learning (redundant features, spurious signals) and draws a clean boundary: measuring a relationship is correlation; predicting from it is regression, which is taught in the Classical Machine Learning track.

Where this fits

This is lesson 4 of 14 and the final lesson of Phase 1. It builds on the standardization and z-score idea from the center-and-spread lesson and on the shape-reading from the histogram lesson. The next lesson, Probability foundations, opens Phase 2 (The laws of chance) and turns the track from describing data to reasoning about uncertainty. The deliberate boundary with regression points outward to a separate track rather than forward within this one.

Before you start

Prerequisites: the previous lesson (The shape of data) for context, and ideally the center-and-spread lesson, since the correlation coefficient is built from the standardization idea introduced there. No heavy computation is required; you will read scatterplots and interpret a coefficient rather than calculate one by hand.

About the math

This lesson stays at the intuition level. You interpret the correlation coefficient rather than compute it: what its sign and size mean, and why it is blind to curves. The one formula idea (that the coefficient is essentially the average of the products of the two variables’ z-scores) is given for understanding, not for calculation. The scatterplots are drawn in plain text so you can read them anywhere.

By the end, you’ll be able to

Read a scatterplot and describe the direction and strength of the relationship it shows
Interpret a correlation coefficient between -1 and +1 (sign as direction, magnitude as strength)
Explain that correlation measures only linear association, so a strong curved relationship can have a coefficient near zero
Explain why correlation does not imply causation, and name the alternatives (reverse cause, a hidden variable, coincidence)
Connect correlation to machine learning (redundant features, spurious signals) and distinguish it from prediction, which belongs to the Classical ML track

Time and difficulty

Read time: about 11 minutes
Practice time: about 13 minutes (a self-check, a match-the-coefficient exercise reading scatterplots, an explain-the-correlation exercise that drills correlation-is-not-causation, and flashcards)
Difficulty: standard (conceptual; reading relationships, not computing them)