Summary: Summarizing data: center and spread

Before any model learns, someone summarizes the data, and a careless summary can mislead just as badly as a careful one informs. Every summary answers two questions: where is the center of the data, and how spread out is it? A company where nine people earn $50,000 and the founder earns $2 million has a mean salary of $245,000 and a median of $50,000, and only one of those numbers describes the typical employee. This lesson is the two questions, the tools for each, and why they are step zero of machine learning. This summary is the scan-in-five-minutes version of the full lesson.

Core ideas

Center, question one. The mean (add and divide) uses every value, so outliers drag it. The median (the middle value) resists outliers; it depends only on order and count. The mode (most common value) is the center for categorical data.
Skew is when they disagree. When a few extreme values stretch one tail, the mean is pulled toward the extremes while the median holds at the center. For skewed data (incomes, prices, wait times), the median is the more honest “typical” value. The salary example: mean $245,000, median $50,000.
Spread, question two. Center alone is half a description: two datasets with the same mean (49, 50, 51 versus 0, 50, 100) can look nothing alike. The range (max minus min) is simple but fragile. Variance is the average squared distance from the mean; standard deviation is its square root, the typical distance from the mean, in the original units.
Why square the distances. Signed distances from the mean always average to zero, so they are squared (making them positive and weighting larger gaps more), then square-rooted at the end to return to the original units.
This is real ML, not a warm-up. Standardizing a feature, subtract its mean and divide by its standard deviation, recenters it at 0 and rescales its spread, so a large-unit feature (income) does not numerically swamp a small-unit one (age). That standardized value is the z-score of a later lesson.
Spread is a signal too. A feature that barely varies carries little for a model to learn; a hugely spread feature can dominate. The summary you compute shapes what the model can learn.

What changes for you

You now read a reported “average” with a question attached: average of what, and is the data skewed? When a headline says the average something is X, you ask whether a few extreme values are inflating it and whether the median would tell a different story. You also stop accepting a center without a spread: “the average is 75” is half an answer until you know whether the data huddles tight or sprawls wide. And when you meet the machine-learning step called standardization or normalization, it is no longer jargon: it is subtracting a mean and dividing by a standard deviation, the two summaries from this lesson, applied to every feature before a model sees it.