Skip to content

Cheatsheet: The shape of data: distributions and histograms

A histogram shows the shape of data, which a center and spread cannot. Before trusting a column of data, look at its shape: skew, outliers, hidden populations, and class imbalance all show up in the picture and hide in the numbers.

Step 1: chop the value range into equal bins.
Step 2: draw a bar for the count of values in each bin.
Bin width too wide -> hides structure (a few fat bars).
Bin width too narrow -> spiky noise (one value per bin).
ShapeLooks likeMean vs medianOften means
SymmetricMirror-image halvesAbout equalBalanced data
Right-skewedLong tail to the rightMean above medianIncomes, prices, durations, counts
Left-skewedLong tail to the leftMean below medianEasy-exam scores, ceilings
UniformAll bins about equalAbout equalFair die, no favored range
BimodalTwo peaksMean in the valley (misleading)Two populations mixed
Bell-shapedSingle peak, smooth tailsAbout equalThe normal distribution (own lesson)

Skew is named for where the tail points, not where the bulk sits. The mean chases the tail.

What each shape warns you about (machine learning)

Section titled “What each shape warns you about (machine learning)”
You seeLikely meaningAction
Strong skewBulk crammed against one edgeConsider a transform (e.g. logarithm)
Lone far-out bar / impossible spikeOutlier or data errorInvestigate and clean
Two peaksTwo subpopulations in one columnSeparate them or add the grouping feature
One label toweringClass imbalance (base-rate trap)Use imbalance-aware metrics; maybe rebalance
  • Trusting summary numbers without seeing the shape.
  • Letting bin width fool you (too wide hides, too narrow invents).
  • Ignoring a second peak (two populations described by neither mean).
  • Reading skew backward (it is named for the tail, not the bulk).
  • Forgetting the labels have a distribution too (imbalance lives there).
  • Histogram: bars showing the count of values in each bin of the range.
  • Bin: one equal interval of the value range.
  • Skew: a stretched tail on one side; right-skew pulls the mean up, left-skew pulls it down.
  • Bimodal: two peaks; usually two mixed populations.
  • Class imbalance: one target class far more common than others; a histogram of labels reveals it.