Skip to content

When two things move together: correlation

Across a summer, ice cream sales go up. So do drownings. Plot them month by month and they track each other closely. Does eating ice cream cause people to drown? Obviously not. A third thing, hot weather, drives both: heat sells ice cream and sends people into the water. The two move together, but neither causes the other.

That story holds both halves of this lesson. The first half is the useful tool: correlation, a precise way to measure how tightly two quantities move together. The second half is the warning that has to ride along with it, because the tool is so easy to misuse: correlation is not causation. Both halves matter for AI, where models are built almost entirely out of correlations and a confused practitioner can read causes into them that are not there.

So far this track has looked at one variable at a time. Correlation is about two variables measured on the same things, like hours studied and exam score for each of a class of students. The natural picture is the scatterplot: one dot per student, placed by their hours (across) and their score (up).

score
100 | *
80 | *
60 | *
40 | *
20 | *
+--------------------
1 2 3 4 5 hours studied

The dots climb from lower-left to upper-right: more hours, higher score. That upward drift is a positive relationship. If the dots fell from upper-left to lower-right (more of one, less of the other, like price and units sold), that would be negative. If they scattered in a shapeless cloud with no drift, there would be little or no relationship at all. The scatterplot is where you should always start, because it shows the relationship before any number summarizes it.

The scatterplot shows the relationship; the correlation coefficient puts a single number on it. Written as a single letter, the correlation coefficient always lands between -1 and +1, and it carries two pieces of information at once.

  • The sign is the direction. A positive coefficient means the two rise together; a negative one means one rises as the other falls.
  • The magnitude is the strength. A value near positive 1 or negative 1 means the points sit almost perfectly on a straight line. A value near zero means no straight-line relationship: the cloud has no consistent tilt.

So a coefficient of 0.95 is a strong positive relationship (tight upward line), a coefficient of -0.9 is a strong negative one, and a coefficient of 0.1 is a weak relationship that is barely there.

Four scatterplots calibrating r values: +0.95, -0.9, 0, and +0.5 Four small scatterplots side by side, each generated procedurally to a target correlation coefficient. The leftmost panel labeled r approximately +0.95 shows points tightly clustered along an upward diagonal. The second panel labeled r approximately negative 0.9 shows points tightly clustered along a downward diagonal. The third panel labeled r approximately 0 shows a shapeless cloud with no visible direction. The fourth panel labeled r approximately +0.5 shows a loose upward trend with substantial scatter. r ≈ +0.95 empirical: 0.98 r ≈ -0.90 empirical: -0.95 r ≈ 0 empirical: 0.17 r ≈ +0.50 empirical: 0.66
Four scatterplots calibrate the eye. A strong positive r (~+0.95) is a tight upward diagonal; a strong negative r (~-0.90) is a tight downward diagonal. A correlation near zero looks like an undirected cloud. A moderate +0.50 is a visible upward trend with real scatter. Eyeball calibration is the first skill; the formula comes second.

Where does this number come from? It connects directly to the standardizing you met earlier in the track. Recall the z-score: how many standard deviations a value sits above or below its mean. The correlation coefficient is, in essence, the average of the products of the two z-scores, one for each variable, across all the points. When a point is above average on both variables (two positive z-scores) or below average on both (two negative z-scores), the product is positive and pushes the coefficient up. When a point is high on one and low on the other, the product is negative and pulls it down. Add those products up and average them, and you get a number that is positive when the variables move together and negative when they move oppositely. You will not compute this value by hand here; the point is that it is built from the same standardized distances you already know.

The catch inside the number: only straight lines

Section titled “The catch inside the number: only straight lines”

There is one technical limit worth burning in: the correlation coefficient measures only linear (straight-line) association. A relationship can be strong, obvious, and important, yet have a coefficient near zero, simply because it is not a straight line.

y
| * *
| * *
| * *
| * *
+----------------------- x

These points trace a clear U-shape: y is high at both extremes of x and low in the middle. There is a strong relationship, but it is not a straight line, so the correlation coefficient comes out near zero. The lesson: a correlation near zero means no linear relationship, not no relationship. Always look at the scatterplot, because the coefficient alone can hide a curve.

A clean U-shape scatterplot with linear correlation coefficient near zero, showing r sees only straight-line relationships A scatterplot inside a framed panel on the left. About 36 blue points trace a clear U-shape: y values are highest at the left and right edges and lowest in the middle, a parabola-like pattern with light noise around it. The empirical Pearson correlation coefficient, computed and shown beneath, is near zero. The legend on the right notes that y is fully determined by x (the relationship is perfectly nonlinear), yet r reports almost no relationship because r only sees straight-line co-movement. strong U-shape, r ≈ 0 x y empirical r = -0.10 what r sees: only straight line co-movement a perfect U-shape means y is determined by x... ...but r ≈ 0 r=0 ≠ "no relationship"
A clear U-shape: as x moves away from zero in either direction, y rises. The relationship between x and y is strong, just not straight. Pearson's r, which only measures straight-line co-movement, comes out near zero. A correlation of zero is not the same as "no relationship"; it only means "no linear relationship".

The warning that has to ride along: correlation is not causation

Section titled “The warning that has to ride along: correlation is not causation”

This is the most misused idea in all of data analysis, so it gets its own section. When two things are correlated, there are several possible explanations, and “the first causes the second” is only one of them:

  1. The first really does cause the second (studying raises scores).
  2. The second causes the first (the arrow points the other way).
  3. A hidden third variable causes both (hot weather drives ice cream and drownings). This hidden cause is called a confounder, and it is the usual culprit behind a surprising correlation.
  4. Coincidence. With enough variables, some will correlate by pure chance with no relationship at all.

A correlation, on its own, cannot tell you which of these is true. Establishing causation takes more than observing that two things move together; it usually takes a controlled experiment or careful reasoning that rules the alternatives out. The discipline is a reflex: every time you hear “X is linked to Y,” ask what else could explain it before you believe X causes Y.

Machine-learning models are, at their core, enormous correlation-finding engines. They learn which patterns in the inputs go with which outputs. That makes both halves of this lesson directly relevant.

  • Redundant features. Two input features that are highly correlated carry largely the same information (height in centimeters and height in inches, or two sensors measuring nearly the same thing). Correlation is how you spot that redundancy, which echoes lesson 1’s point about two features that are really the same thing in disguise.
  • Spurious signals. Because a model chases correlation, it will happily latch onto a correlation that is really a confounder, not a cause. The classic cautionary tale is a model that learns to flag illness from a marking that only appears on scans from the hospital that treated the sick patients, not from anything medical. The correlation was real in the training data and useless (or harmful) in the world. Knowing that correlation is not causation is what makes a practitioner suspicious of a too-convenient signal.
  • Where prediction proper lives. Correlation measures whether and how tightly two variables move together. Actually fitting a line or curve to predict one variable from another, by minimizing prediction error, is regression, and it is taught in the Classical Machine Learning track, not here. This track stays on the descriptive side: measuring and reading the relationship, not building a predictor from it.
  • Reading causation into a correlation. The headline error. A correlation is consistent with reverse causation, a hidden confounder, or coincidence; it never proves cause by itself.
  • Treating a coefficient near zero as “no relationship.” It means no linear relationship. A strong curve (the U-shape) can sit at a coefficient near zero, which is why you always look at the scatterplot.
  • Forgetting outliers move the coefficient. A single extreme point can inflate or deflate the coefficient dramatically; the scatterplot shows it, the number alone does not.
  • Extrapolating past the data. A relationship measured over a range of values says nothing reliable about what happens far outside that range.
  • Confusing measuring a relationship with predicting from it. Correlation describes; regression (the Classical ML track) predicts. They are related but not the same job.
  • A scatterplot shows the relationship between two variables (one dot per observation); always start there.
  • The correlation coefficient lands between -1 and +1: the sign is the direction (together or opposite), the magnitude is the strength (near the ends means a tight straight line, near zero means no linear drift). It is built from the same standardized z-scores you met earlier.
  • The coefficient measures linear association only, so a strong curved relationship can have a value near zero. Look at the picture, not just the number.
  • Correlation is not causation. A correlation can come from a real cause, a reversed cause, a hidden confounder, or coincidence, and observing it alone cannot tell you which.
  • In machine learning, correlation reveals redundant features and warns you about spurious signals a model might exploit; actually predicting one variable from another is regression, which belongs to the Classical Machine Learning track.