Practice: When two things move together: correlation

Two skills to build here: reading a relationship off a scatterplot and a coefficient, and the reflex that keeps you out of trouble, refusing to read cause into a correlation. The second exercise is the one that matters most.

Self-check

Six short questions. Answer each in your head before opening the collapsible.

1. What does the sign of a correlation coefficient tell you, and what does its magnitude tell you?

Show answer

The sign is the direction: positive means the two variables rise together, negative means one rises as the other falls. The magnitude (how close to 1) is the strength: near +1 or -1 the points sit almost on a straight line; near 0 there is no straight-line relationship.

2. A correlation coefficient is exactly 0.02. Does that prove the two variables are unrelated?

Show answer

No. It proves there is no linear relationship. The variables could still have a strong curved relationship (like a U-shape, high at both extremes and low in the middle) that the coefficient cannot see. This is why you always look at the scatterplot, not just the number.

3. Ice cream sales and drownings are strongly correlated across the year. Give the real explanation, and name what that kind of variable is called.

Show answer

Hot weather drives both: heat sells ice cream and sends people swimming. Neither causes the other; a hidden third variable causes both. That hidden variable is called a confounder, and it is the usual culprit behind a surprising correlation.

4. List the possible explanations for an observed correlation between X and Y.

Show answer

X causes Y; Y causes X (the arrow runs the other way); a hidden third variable (a confounder) causes both; or it is coincidence (with enough variables, some correlate by chance). A correlation on its own cannot tell you which is true.

5. How does correlation help spot a redundant feature in a machine-learning dataset?

Show answer

Two features that are very highly correlated carry nearly the same information (height in centimeters and in inches, or two sensors reading almost the same thing). A high correlation between two inputs flags that one may be redundant, which is the lesson-1 point about two features that are really the same thing in disguise.

6. What is the difference between correlation and regression?

Show answer

Correlation measures whether and how tightly two variables move together (a number between -1 and +1). Regression predicts one variable from another by fitting a line or curve that minimizes prediction error. Correlation describes the relationship; regression builds a predictor from it, and regression is taught in the Classical Machine Learning track, not here.

Try it yourself: match the scatterplot to its coefficient

For each scatterplot, pick the closest correlation coefficient from this list: +0.95, +0.5, 0.0 (with a curve), -0.9. Then check.

Plot 1                  Plot 2
y|        *             y| *
 |      *                |    *
 |    *                  |       *
 |  *                    |          *
 +----------- x          +----------- x

Plot 3                  Plot 4
y|   *    *             y|  *   *
 | *   *    *            |    *
 |   *    *   *          | *      *
 |     *    *            |   *  *   *  (loose cloud, slight upward drift)
 +----------- x          +----------- x

Show answer

Plot 1: +0.95. A tight upward line; strong positive.
Plot 2: -0.9. A tight downward line; strong negative.
Plot 3: 0.0 (with a curve). The points arch (high at the ends, low in the middle); a real relationship, but not linear, so the coefficient is near zero. The reminder to always look at the picture.
Plot 4: +0.5. A loose cloud with a mild upward tilt; a moderate positive relationship, not a tight line.

The skill: separate direction (which way it tilts) from strength (how tight), and never let a near-zero coefficient hide a curve.

Try it yourself: explain the correlation (do not assume cause)

Each pair of variables is genuinely correlated. For each, give a more likely explanation than “the first causes the second,” and name which kind it is (reverse cause, confounder, or coincidence).

A. Towns with more firefighters at a blaze tend to have more fire damage.
B. Children with bigger shoe sizes tend to read better.
C. People who use the company's premium feature tend to stay subscribed
   longer.
D. The number of films a certain actor released each year correlates with
   the yearly number of a specific kind of accident.

Show answer

A: confounder. The size of the fire drives both. Bigger fires summon more firefighters and cause more damage; the firefighters are not the cause. (A textbook confounding case.)
B: confounder. Age drives both. Older children have bigger feet and read better; shoe size does not cause reading skill.
C: plausibly reverse cause or confounder, not simple cause. It is tempting to say the feature causes loyalty, but more engaged users may both adopt premium features and stay longer (a confounder), or loyalty may drive feature use. You would need an experiment to claim the feature causes retention.
D: coincidence. Two unrelated counts that happen to track each other over a stretch of years; with enough series compared, some will line up by pure chance.

The reflex to build: when you see a correlation, run through reverse cause, confounder, and coincidence before believing the obvious causal story.

Flashcards

Eight cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page for offline review.

Q. What does a scatterplot show, and why start there?

One dot per observation, placed by its two values. It shows the direction and strength of a relationship visually, before any single number summarizes (and possibly hides) it.

Q. What range does the correlation coefficient take, and what do its sign and magnitude mean?

It runs from -1 to +1. The sign is direction (positive = rise together, negative = move oppositely); the magnitude is strength (near the ends = tight straight line, near 0 = no linear drift).

Q. A correlation coefficient near 0 means what, exactly?

No LINEAR relationship. A strong curved relationship (like a U-shape) can have a coefficient near zero, so a near-zero value does not mean the variables are unrelated. Always check the scatterplot.

Q. What are the possible explanations for an observed correlation?

X causes Y; Y causes X; a hidden third variable (confounder) causes both; or coincidence. A correlation alone cannot tell you which.

Q. What is a confounder?

A hidden third variable that causes both correlated variables, creating a correlation between them with no direct cause (hot weather behind ice cream sales and drownings). The usual culprit behind a surprising correlation.

Q. How is the correlation coefficient related to z-scores?

It is essentially the average of the products of the two variables’ z-scores. Points above (or below) average on both push it positive; points high on one and low on the other pull it negative.

Q. How does correlation help with feature selection in ML?

Two highly correlated input features carry nearly the same information, so one may be redundant. Correlation flags that overlap (the lesson-1 ‘two features that are the same thing in disguise’ point).

Q. Correlation vs regression: which describes and which predicts?

Correlation measures how tightly two variables move together (a number from -1 to +1). Regression fits a line/curve to predict one from another. Regression (prediction) is taught in the Classical ML track, not here.