Practice: The shape of data: distributions and histograms

The skill this lesson builds is reading a distribution at a glance: naming its shape, predicting how its mean and median relate, and noticing when the shape is warning you about the data. The exercises below are all about reading pictures.

Self-check

Six short questions. Answer each in your head before opening the collapsible.

1. What two steps turn a list of numbers into a histogram?

Show answer

First, chop the range of values into equal intervals called bins. Second, count how many values fall into each bin and draw a bar that tall. The result shows the shape of the data, which a list of numbers hides.

2. Why does bin width matter?

Show answer

Too-wide bins collapse the data into a few fat bars that hide real structure; too-narrow bins turn the picture into spiky noise with one or two values per bin. The same data can look smooth or jagged depending on the bins, so it is always worth asking how they were chosen.

3. A histogram has a long tail stretching to the right. What is this shape called, and how do the mean and median compare?

Show answer

It is right-skewed (positive skew). The long right tail drags the mean above the median. The shape is named for where the tail points (right), even though most of the data sits on the left.

4. You see a histogram with two distinct peaks. What does that often mean?

Show answer

That two different populations are likely mixed into one column (two server regions, two classes, a mixed group). A single mean can land in the empty valley between the peaks, describing nobody. It is a signal to separate the groups.

5. How does a histogram reveal class imbalance, and why does it matter for AI?

Show answer

Plot the histogram of the labels (the target classes). If one class towers over the others (say 99% versus 1%), that is class imbalance, the base-rate situation from lesson 1 made visible. A model can score high by always predicting the majority class while detecting nothing, so you need to see the imbalance before trusting an accuracy number.

6. Why look at a feature’s distribution before modeling instead of just its mean and standard deviation?

Show answer

Because shape carries information the summary numbers cannot. The histogram surfaces skew (suggesting a transform), outliers and data errors (a lone far-out bar), and hidden subpopulations (a second peak), none of which a mean and standard deviation reveal on their own.

Try it yourself: name the shape, predict mean vs median

For each histogram, name the shape and say whether the mean is above, below, or about equal to the median. Then check.

Histogram A
 0 to 10  | ##########  (10)
10 to 20  | #####       (5)
20 to 30  | ##          (2)
30 to 40  | #           (1)

Histogram B
group 1 | #######  (7)
group 2 | ##       (2)
group 3 | #        (1)
group 4 | ##       (2)
group 5 | #######  (7)

Histogram C
 50 to 60  | #            (1)
 60 to 70  | ##           (2)
 70 to 80  | ####         (4)
 80 to 90  | #######      (7)
 90 to 100 | ##########   (10)

Histogram D
| #
| ####
| ########
| ####
| #

Show answer

A: right-skewed. Bulk on the left, long tail to the right. The tail pulls the mean above the median.
B: bimodal. Two peaks (groups 1 and 5) with a valley between. Likely two mixed populations; a single mean lands in the low-count middle and describes neither group.
C: left-skewed. Bulk on the right (high scores), long tail to the left. The tail pulls the mean below the median. (The classic easy-exam shape.)
D: symmetric, bell-shaped. Mirror-image halves, single peak. The mean and median are about equal, near the center.

The rule to lock in: the mean chases the tail. Right tail, mean above median; left tail, mean below; symmetric, they meet.

Try it yourself: what is the shape warning you about?

Each item describes a histogram of a feature or label in a dataset headed for a model. Say in one line what the shape is warning you about and what you would do.

1. The histogram of "account age in days" has a tall spike at exactly 0 and
   a smooth spread everywhere else.
2. The histogram of the target label shows 9,800 "not fraud" and 200 "fraud".
3. The histogram of "session length" is heavily right-skewed, crammed against
   the left with a long thin tail of very long sessions.
4. The histogram of "user height in cm" has two clear humps, one around 165
   and one around 178.

Show answer

1: The spike at 0 is likely an outlier or data error (a default or missing value coded as 0), not a real account age. Investigate and clean before modeling; a mean would silently absorb it.
2: Severe class imbalance (98% vs 2%). Accuracy will be misleading; a model that always says “not fraud” scores 98%. Plan to measure with metrics that respect the rare class and possibly rebalance.
3: Right skew. Consider a transform (such as the logarithm) so the bulk is not crammed against the edge, which helps models that expect more balanced spreads.
4: Bimodal, very likely two subpopulations (here, plausibly two groups by some hidden variable). Consider modeling them separately or adding the grouping variable as a feature; a single mean height describes neither hump.

Flashcards

Eight cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page for offline review.

Q. What is a histogram, in two steps?

Chop the value range into equal bins, then draw a bar for the count of values in each bin. It reveals the shape of the data, which summary numbers hide.

Q. Why does bin width matter when reading a histogram?

Too-wide bins hide real structure in a few fat bars; too-narrow bins manufacture spiky noise. The same data can look smooth or jagged, so ask how the bins were chosen.

Q. Right-skewed: where is the tail, and how do mean and median compare?

The long tail points right (toward high values); most data sits on the left. The tail drags the mean above the median. (Skew is named for the tail’s direction.)

Q. Left-skewed: where is the tail, and how do mean and median compare?

The long tail points left (toward low values); most data sits on the right. The tail drags the mean below the median.

Q. What does a bimodal (two-peak) histogram usually indicate?

That two different populations are likely mixed into one column. A single mean can fall in the valley between peaks and describe neither group; consider separating them.

Q. How does a histogram of the labels reveal class imbalance?

If one class towers over the others (e.g., 98% vs 2%), the bars show it directly. That is the base-rate problem made visible: a majority-class guesser scores high while detecting nothing.

Q. What is the bell shape, and why does it get its own lesson?

A symmetric, single-peaked distribution with smoothly falling tails: the normal distribution. It is so common (measurement error, many natural quantities, model scores) that Track 9 devotes a full lesson to it.

Q. Why inspect a feature's distribution before modeling?

Shape carries information a mean and standard deviation cannot: skew (suggesting a transform), outliers and data errors, and hidden subpopulations all show up in a histogram and not in summary numbers.