Cheatsheet: When two things move together: correlation
The one idea
Section titled “The one idea”Correlation measures how tightly two variables move together. It never proves one causes the other. Both halves matter in AI, where models are correlation engines.
Reading the correlation coefficient (r)
Section titled “Reading the correlation coefficient (r)”r ranges from -1 to +1. sign = direction (+ rise together, - move oppositely) magnitude = strength (near +/-1 = tight straight line, near 0 = no linear drift)Examples: +0.95 strong positive -0.9 strong negative +0.5 moderate positive 0.05 essentially no linear relationshipBuilt from z-scores: roughly the average of the products of each point'stwo z-scores (above-average-on-both or below-on-both pushes r up).The big caveat: only straight lines
Section titled “The big caveat: only straight lines”A U-shaped relationship (high at both ends, low in the middle) is STRONGbut NONLINEAR, so r is near 0. Near-zero r = no LINEAR relationship,not "no relationship." Always look at the scatterplot.Four explanations for any correlation between X and Y
Section titled “Four explanations for any correlation between X and Y”| Explanation | Example |
|---|---|
| X causes Y | Studying raises exam scores |
| Y causes X | The arrow runs the other way |
| A confounder causes both | Hot weather behind ice cream sales and drownings |
| Coincidence | Two unrelated series that happen to track over a span |
Observing the correlation cannot tell you which. Causation usually needs a controlled experiment.
In machine learning
Section titled “In machine learning”| Use | What it means |
|---|---|
| Spotting redundant features | Two highly correlated inputs carry the same information; one may be dropped |
| Spotting spurious signals | A model chases correlation and can latch onto a confounder that fails in the world |
| Boundary | Correlation DESCRIBES the relationship; REGRESSION predicts from it (Classical ML track, not here) |
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Reading causation into a correlation (run through the four explanations first).
- Treating r near 0 as “no relationship” (it means no linear one).
- Forgetting a single outlier can swing r (the scatterplot shows it).
- Extrapolating a relationship far past the data range.
- Confusing measuring a relationship (correlation) with predicting from it (regression).
Words to use precisely
Section titled “Words to use precisely”- Scatterplot: one dot per observation, placed by its two values.
- Correlation coefficient (r): a number in [-1, +1]; sign is direction, magnitude is strength of the linear relationship.
- Confounder: a hidden variable causing both correlated variables.
- Correlation vs causation: moving together is not the same as one driving the other.
- Regression: fitting a line/curve to predict one variable from another (Classical ML track).