Cheatsheet: Summarizing data: center and spread
The one idea
Section titled “The one idea”Every summary answers two questions: where is the center of the data, and how spread out is it? Center without spread is half a description, and the wrong center (mean on skewed data) is a lie.
The three centers
Section titled “The three centers”| Center | How | Uses every value? | Outlier-sensitive? | Best for |
|---|---|---|---|---|
| Mean | Add all, divide by count | Yes | Yes (dragged by outliers) | Symmetric, outlier-free data |
| Median | Middle value when sorted | No | No (robust) | Skewed data, data with outliers |
| Mode | Most frequent value | No | No | Categorical data |
When the mean and median disagree, the data is skewed; report the median as the typical value.
The spread measures
Section titled “The spread measures”| Measure | How | Note |
|---|---|---|
| Range | Max minus min | Simple but fragile; only the two extremes |
| Variance | Average of squared distances from the mean | In squared units; hard to read directly |
| Standard deviation | Square root of the variance | Typical distance from the mean, in original units |
Worked example (eight scores)
Section titled “Worked example (eight scores)”Data: 2, 4, 4, 4, 5, 5, 7, 9Mean = 40 / 8 = 5 Median = (4+5)/2 = 4.5 Mode = 4Squared distances from 5: 9,1,1,1,0,0,4,16 (sum = 32)Variance = 32 / 8 = 4 Standard deviation = sqrt(4) = 2Standardizing a feature (machine learning)
Section titled “Standardizing a feature (machine learning)”standardized value = (value - mean) / standard deviationRecenters the feature at 0, rescales its spread to be comparable across features, and keeps a large-unit feature (income) from numerically dominating a small-unit one (age). The standardized value is the z-score (a later lesson).
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Reporting the mean for skewed data (use the median).
- Reporting a center with no spread.
- Confusing variance (squared units) with standard deviation (original units).
- Using the range as your spread (one outlier blows it up).
- Forgetting the mode is the only center that works for categories.
Words to use precisely
Section titled “Words to use precisely”- Mean: the arithmetic average; uses every value; outlier-sensitive.
- Median: the middle value when sorted; robust to outliers.
- Mode: the most frequent value; the center for categorical data.
- Variance: average squared distance from the mean.
- Standard deviation: square root of variance; typical distance from the mean, in original units.
- Skew: a few extreme values stretching one tail, pushing the mean away from the median.