Cheatsheet: The shape of data: distributions and histograms
The one idea
Section titled “The one idea”A histogram shows the shape of data, which a center and spread cannot. Before trusting a column of data, look at its shape: skew, outliers, hidden populations, and class imbalance all show up in the picture and hide in the numbers.
What a histogram is
Section titled “What a histogram is”Step 1: chop the value range into equal bins.Step 2: draw a bar for the count of values in each bin.Bin width too wide -> hides structure (a few fat bars).Bin width too narrow -> spiky noise (one value per bin).The shapes and what mean vs median does
Section titled “The shapes and what mean vs median does”| Shape | Looks like | Mean vs median | Often means |
|---|---|---|---|
| Symmetric | Mirror-image halves | About equal | Balanced data |
| Right-skewed | Long tail to the right | Mean above median | Incomes, prices, durations, counts |
| Left-skewed | Long tail to the left | Mean below median | Easy-exam scores, ceilings |
| Uniform | All bins about equal | About equal | Fair die, no favored range |
| Bimodal | Two peaks | Mean in the valley (misleading) | Two populations mixed |
| Bell-shaped | Single peak, smooth tails | About equal | The normal distribution (own lesson) |
Skew is named for where the tail points, not where the bulk sits. The mean chases the tail.
What each shape warns you about (machine learning)
Section titled “What each shape warns you about (machine learning)”| You see | Likely meaning | Action |
|---|---|---|
| Strong skew | Bulk crammed against one edge | Consider a transform (e.g. logarithm) |
| Lone far-out bar / impossible spike | Outlier or data error | Investigate and clean |
| Two peaks | Two subpopulations in one column | Separate them or add the grouping feature |
| One label towering | Class imbalance (base-rate trap) | Use imbalance-aware metrics; maybe rebalance |
Pitfalls to dodge
Section titled “Pitfalls to dodge”- Trusting summary numbers without seeing the shape.
- Letting bin width fool you (too wide hides, too narrow invents).
- Ignoring a second peak (two populations described by neither mean).
- Reading skew backward (it is named for the tail, not the bulk).
- Forgetting the labels have a distribution too (imbalance lives there).
Words to use precisely
Section titled “Words to use precisely”- Histogram: bars showing the count of values in each bin of the range.
- Bin: one equal interval of the value range.
- Skew: a stretched tail on one side; right-skew pulls the mean up, left-skew pulls it down.
- Bimodal: two peaks; usually two mixed populations.
- Class imbalance: one target class far more common than others; a histogram of labels reveals it.