Skip to content

Summarizing data: center and spread

This is lesson 2 of Track 9 (Statistics & Probability for AI) and the first concrete tool in the track. After the opening lesson made the case that AI reasons under uncertainty, this one starts where every real machine-learning project starts: describing the data before any model touches it. You will learn the two questions every summary answers, where is the center and how spread out is the data, the tools for each, and why a careless summary can mislead a human or a model. The source curriculum is Khan Academy’s Statistics & Probability course, by Sal Khan and the Khan Academy team, freely available and cited as further study.

The lesson opens with a salary example where the “average” is technically true and completely misleading, uses it to separate the mean (uses every value, dragged by outliers) from the median (the robust middle) and the mode (the center for categories), then turns to spread: the range, and the workhorse pair of variance and standard deviation, worked by hand on a small dataset. It closes by connecting both to machine learning, where standardizing a feature by its mean and standard deviation is one of the most common first steps in the pipeline.

This is lesson 2 of 14, the second lesson of Phase 1 (Describing data). The previous lesson, Why AI runs on statistics, gave the map of the whole track; this one delivers the first technique on it. The next lesson, The shape of data: distributions and histograms, turns these numeric summaries into pictures and shows how skew, which here is a disagreement between the mean and the median, jumps straight out of a histogram.

Prerequisites: the previous lesson (Why AI runs on statistics) for context, and comfort with basic arithmetic. You will add, divide, square a few small numbers, and take a square root; that is the whole toolkit. No algebra beyond that is required.

This lesson has real arithmetic, but all of it is small and hand-sized. You compute a mean by adding and dividing, find a median by sorting, and work a standard deviation on a dataset of eight numbers by squaring distances and taking a square root. Every formula is anchored to a worked example you can follow on paper. The goal is to make the summaries feel concrete, not to memorize notation.

  • Compute and interpret the mean, median, and mode of a dataset
  • Choose between the mean and the median based on skew and outliers, and say what each one hides
  • Explain what variance and standard deviation measure and compute them on a small dataset
  • Explain why center and spread together describe a dataset that either alone cannot
  • Connect summarizing to machine learning, where features are standardized by their mean and standard deviation
  • Read time: about 11 minutes
  • Practice time: about 14 minutes (a self-check, a mean-versus-median judgment exercise, a full by-hand computation of all five summaries, and flashcards)
  • Difficulty: standard (light, hand-sized arithmetic; no algebra beyond squaring and square roots)