Summarizing data: center and spread
A startup has ten people. Nine of them earn $50,000 a year. The founder pays herself $2 million. A recruiter posts that the “average salary” is $245,000 and it is technically true: add up all ten salaries and divide by ten. It is also a lie that would make any of those nine employees laugh, because not one of them earns anything close to it. The typical person at this company makes $50,000, and a different summary, the median, says exactly that.
This is the whole lesson in one story. Before a machine-learning model touches data, someone summarizes it, and a careless summary can mislead a human or a model just as badly as it misleads that recruiter. Every summary answers two questions: where is the center of the data, and how spread out is it? Get comfortable with those two questions and you can read a dataset honestly, which is step zero of everything that follows.
Where is the center?
Section titled “Where is the center?”There are three common answers, and the differences between them matter.
The mean is the everyday average: add up the values and divide by how many there are. The mean uses every single data point, which is its strength and its weakness. Because every value pulls on it, a single extreme value (an outlier) drags the mean toward itself. The $2 million salary is why the mean salary was $245,000.
The median is the middle value when you line the data up in order (with an even count, it is the average of the two middle values). Half the data sits below it, half above. The median does not care how extreme the outliers are, only how many values fall on each side, so it is robust: the founder could pay herself $2 million or $200 million and the median salary stays $50,000.
The mode is the value that appears most often. It is the natural summary for categories (the most common product category, the most frequent error code) where “average” makes no sense.
For the salary data, lined up in order, nine values of $50,000 and one of $2,000,000:
Mean = (9 x 50,000 + 2,000,000) / 10 = 2,450,000 / 10 = 245,000Median = average of the 5th and 6th values = (50,000 + 50,000) / 2 = 50,000Mode = 50,000 (the most common salary)When the mean and median disagree this much, the data is skewed: a few extreme values are stretching one tail. The rule of thumb: for skewed data (incomes, house prices, response times, anything with a long tail), the median is the more honest summary of the typical case. For roughly symmetric data, the mean and median nearly agree and the mean is fine. The mistake is reporting a mean for skewed data and calling it “typical.”
How spread out is it?
Section titled “How spread out is it?”Center alone is not enough. Consider two teams whose test scores both average 75. Team A scored 74, 75, 76; team B scored 50, 75, 100. Same center, completely different stories: team A is consistent, team B is wildly variable. Spread is the second question, and it is often the more interesting one.
The range is the simplest measure: the largest value minus the smallest. It is easy but fragile, because it depends only on the two most extreme points and ignores everything in between.
Variance and standard deviation are the workhorses. The idea is to measure how far the values sit from the mean, on average. You take each value’s distance from the mean, square it (so that distances above and below do not cancel out, and larger distances count for more), and average those squared distances. That average is the variance. Because squaring changes the units (squared dollars, squared scores), you take the square root at the end to get back to the original units, and that is the standard deviation: roughly, the typical distance of a value from the mean.
Work it on a small dataset of eight scores: 2, 4, 4, 4, 5, 5, 7, 9.
Mean = (2 + 4 + 4 + 4 + 5 + 5 + 7 + 9) / 8 = 40 / 8 = 5Distance from the mean: -3, -1, -1, -1, 0, 0, 2, 4Squared distances: 9, 1, 1, 1, 0, 0, 4, 16 (sum = 32)Variance = 32 / 8 = 4Standard deviation = square root of 4 = 2So the typical score sits about 2 points from the mean of 5. A standard deviation of 2 on data centered at 5 is a different world from a standard deviation of 20: same center, very different spread. (One technical note for later. The calculation here divides by the count, which is the right thing when you are simply describing the data in front of you. When instead you have a sample and want to estimate the spread of the larger population it came from, you divide by one less than the count. The reason is subtle but real: a sample’s own mean sits slightly closer to its own points than the true population mean does, so dividing by the full count would systematically understate the spread, and dropping one from the divisor corrects for it. The intuition, typical distance from the mean, is identical; only the divisor shifts.)
Why this matters when you use AI
Section titled “Why this matters when you use AI”Summarizing is not a warm-up that real machine learning skips. It is woven into the pipeline.
- Standardizing features. A model that takes in both “age in years” (roughly 0 to 100) and “income in dollars” (0 to millions) sees income as numerically gigantic next to age, and many algorithms will overweight it for no good reason. The standard fix is standardization: subtract each feature’s mean and divide by its standard deviation, so every feature is recentered at 0 and rescaled to a comparable spread. That is exactly the mean and standard deviation from this lesson, applied per feature. The resulting standardized value, how many standard deviations a point sits from the mean, is the z-score of a later lesson.
- Outliers distort training. Because the mean is outlier-sensitive, a few corrupted or extreme rows can skew a feature’s summary and, through it, the model. Spotting that the mean and median disagree is often the first sign that outliers need attention.
- Spread is a signal. A feature that barely varies (tiny standard deviation) carries little information for a model to learn from; a feature with huge spread may dominate. Knowing the spread tells you which features are even worth keeping.
The summary you compute before modeling shapes what the model can learn. Reading it honestly is not optional.
Common pitfalls
Section titled “Common pitfalls”- Reporting the mean for skewed data. When a few extreme values stretch a tail, the mean stops describing the typical case. Incomes, prices, and wait times are almost always skewed; reach for the median.
- Reporting center without spread. “The average is 75” is half a description. Two datasets with the same mean can be worlds apart; always pair center with a measure of spread.
- Confusing variance with standard deviation. Variance is in squared units (squared dollars), which are hard to interpret. The standard deviation is its square root, back in the original units, which is why it is the one usually reported.
- Forgetting the mode for categories. Mean and median need numbers. For categorical data (favorite color, product type), the mode is the only center that makes sense.
- Trusting the range as your spread. The range depends only on the two most extreme points, so a single outlier can blow it up while the bulk of the data is tightly packed. Standard deviation describes the whole set.
What you should remember
Section titled “What you should remember”- Every summary answers two questions: where is the center (mean, median, mode) and how spread out is the data (range, variance, standard deviation). Center alone is half the picture.
- The mean uses every value and is dragged by outliers; the median is the robust middle. When they disagree, the data is skewed, and the median is usually the more honest “typical” value.
- Variance is the average squared distance from the mean; standard deviation is its square root, the typical distance from the mean, reported in the original units.
- In machine learning, standardizing features by subtracting the mean and dividing by the standard deviation is one of the most common first steps, and it is exactly the tools from this lesson applied per feature.
- Reading a summary honestly, noticing skew, checking spread, and watching for outliers, is step zero of working with data, for you and for any model you build.