When two things move together: correlation
Across a summer, ice cream sales go up. So do drownings. Plot them month by month and they track each other closely. Does eating ice cream cause people to drown? Obviously not. A third thing, hot weather, drives both: heat sells ice cream and sends people into the water. The two move together, but neither causes the other.
That story holds both halves of this lesson. The first half is the useful tool: correlation, a precise way to measure how tightly two quantities move together. The second half is the warning that has to ride along with it, because the tool is so easy to misuse: correlation is not causation. Both halves matter for AI, where models are built almost entirely out of correlations and a confused practitioner can read causes into them that are not there.
Seeing a relationship: the scatterplot
Section titled “Seeing a relationship: the scatterplot”So far this track has looked at one variable at a time. Correlation is about two variables measured on the same things, like hours studied and exam score for each of a class of students. The natural picture is the scatterplot: one dot per student, placed by their hours (across) and their score (up).
score100 | * 80 | * 60 | * 40 | * 20 | * +-------------------- 1 2 3 4 5 hours studiedThe dots climb from lower-left to upper-right: more hours, higher score. That upward drift is a positive relationship. If the dots fell from upper-left to lower-right (more of one, less of the other, like price and units sold), that would be negative. If they scattered in a shapeless cloud with no drift, there would be little or no relationship at all. The scatterplot is where you should always start, because it shows the relationship before any number summarizes it.
Measuring it: the correlation coefficient
Section titled “Measuring it: the correlation coefficient”The scatterplot shows the relationship; the correlation coefficient puts a single number on it. Written as a single letter, the correlation coefficient always lands between -1 and +1, and it carries two pieces of information at once.
- The sign is the direction. A positive coefficient means the two rise together; a negative one means one rises as the other falls.
- The magnitude is the strength. A value near positive 1 or negative 1 means the points sit almost perfectly on a straight line. A value near zero means no straight-line relationship: the cloud has no consistent tilt.
So a coefficient of 0.95 is a strong positive relationship (tight upward line), a coefficient of -0.9 is a strong negative one, and a coefficient of 0.1 is a weak relationship that is barely there.
Where does this number come from? It connects directly to the standardizing you met earlier in the track. Recall the z-score: how many standard deviations a value sits above or below its mean. The correlation coefficient is, in essence, the average of the products of the two z-scores, one for each variable, across all the points. When a point is above average on both variables (two positive z-scores) or below average on both (two negative z-scores), the product is positive and pushes the coefficient up. When a point is high on one and low on the other, the product is negative and pulls it down. Add those products up and average them, and you get a number that is positive when the variables move together and negative when they move oppositely. You will not compute this value by hand here; the point is that it is built from the same standardized distances you already know.
The catch inside the number: only straight lines
Section titled “The catch inside the number: only straight lines”There is one technical limit worth burning in: the correlation coefficient measures only linear (straight-line) association. A relationship can be strong, obvious, and important, yet have a coefficient near zero, simply because it is not a straight line.
y | * * | * * | * * | * * +----------------------- xThese points trace a clear U-shape: y is high at both extremes of x and low in the middle. There is a strong relationship, but it is not a straight line, so the correlation coefficient comes out near zero. The lesson: a correlation near zero means no linear relationship, not no relationship. Always look at the scatterplot, because the coefficient alone can hide a curve.
The warning that has to ride along: correlation is not causation
Section titled “The warning that has to ride along: correlation is not causation”This is the most misused idea in all of data analysis, so it gets its own section. When two things are correlated, there are several possible explanations, and “the first causes the second” is only one of them:
- The first really does cause the second (studying raises scores).
- The second causes the first (the arrow points the other way).
- A hidden third variable causes both (hot weather drives ice cream and drownings). This hidden cause is called a confounder, and it is the usual culprit behind a surprising correlation.
- Coincidence. With enough variables, some will correlate by pure chance with no relationship at all.
A correlation, on its own, cannot tell you which of these is true. Establishing causation takes more than observing that two things move together; it usually takes a controlled experiment or careful reasoning that rules the alternatives out. The discipline is a reflex: every time you hear “X is linked to Y,” ask what else could explain it before you believe X causes Y.
Why this matters when you use AI
Section titled “Why this matters when you use AI”Machine-learning models are, at their core, enormous correlation-finding engines. They learn which patterns in the inputs go with which outputs. That makes both halves of this lesson directly relevant.
- Redundant features. Two input features that are highly correlated carry largely the same information (height in centimeters and height in inches, or two sensors measuring nearly the same thing). Correlation is how you spot that redundancy, which echoes lesson 1’s point about two features that are really the same thing in disguise.
- Spurious signals. Because a model chases correlation, it will happily latch onto a correlation that is really a confounder, not a cause. The classic cautionary tale is a model that learns to flag illness from a marking that only appears on scans from the hospital that treated the sick patients, not from anything medical. The correlation was real in the training data and useless (or harmful) in the world. Knowing that correlation is not causation is what makes a practitioner suspicious of a too-convenient signal.
- Where prediction proper lives. Correlation measures whether and how tightly two variables move together. Actually fitting a line or curve to predict one variable from another, by minimizing prediction error, is regression, and it is taught in the Classical Machine Learning track, not here. This track stays on the descriptive side: measuring and reading the relationship, not building a predictor from it.
Common pitfalls
Section titled “Common pitfalls”- Reading causation into a correlation. The headline error. A correlation is consistent with reverse causation, a hidden confounder, or coincidence; it never proves cause by itself.
- Treating a coefficient near zero as “no relationship.” It means no linear relationship. A strong curve (the U-shape) can sit at a coefficient near zero, which is why you always look at the scatterplot.
- Forgetting outliers move the coefficient. A single extreme point can inflate or deflate the coefficient dramatically; the scatterplot shows it, the number alone does not.
- Extrapolating past the data. A relationship measured over a range of values says nothing reliable about what happens far outside that range.
- Confusing measuring a relationship with predicting from it. Correlation describes; regression (the Classical ML track) predicts. They are related but not the same job.
What you should remember
Section titled “What you should remember”- A scatterplot shows the relationship between two variables (one dot per observation); always start there.
- The correlation coefficient lands between -1 and +1: the sign is the direction (together or opposite), the magnitude is the strength (near the ends means a tight straight line, near zero means no linear drift). It is built from the same standardized z-scores you met earlier.
- The coefficient measures linear association only, so a strong curved relationship can have a value near zero. Look at the picture, not just the number.
- Correlation is not causation. A correlation can come from a real cause, a reversed cause, a hidden confounder, or coincidence, and observing it alone cannot tell you which.
- In machine learning, correlation reveals redundant features and warns you about spurious signals a model might exploit; actually predicting one variable from another is regression, which belongs to the Classical Machine Learning track.