Skip to content

Cheatsheet: Why seeing is hard for machines

ConceptOne line
Image (to a computer)A grid of numbers: 3 values (R, G, B), each 0 to 255, per pixel
Semantic gapThe distance from raw pixel numbers to the meaning a human reads instantly
Core taskImage classification: numbers in, a label out
What failsHand-written rules (every rule meets an image that breaks it)
What worksData-driven approach: learn patterns from labeled examples

The recognition challenges (why one label spans many images)

Section titled “The recognition challenges (why one label spans many images)”
ChallengeWhat changes the pixels
ViewpointDifferent camera angles share almost no pixel values
ScaleObject near vs far: different count and position of pixels
DeformationNon-rigid objects take endless shapes
OcclusionMost of the object may be hidden
IlluminationLighting swings every value up or down
Background clutterObject blends into a busy scene
Intra-class variationOne category (e.g. “cat”) spans many looks

Constant through all of them: the label. Variable through all of them: the numbers.

StepAction
1. CollectGather a large dataset of labeled images
2. TrainAdjust the model until predictions match the labels
3. Predict and evaluateTest on unseen images; measure accuracy on the unseen

Standard that matters: accuracy on images the model never saw (generalization), not accuracy on the training set (which can be memorization).

Image-size arithmetic (values = width x height x channels)

Section titled “Image-size arithmetic (values = width x height x channels)”
ImageValues
28 x 28 grayscale (a digit)784
224 x 224 color150,528
1000 x 1000 color3,000,000

Channels: grayscale = 1, color = 3.

PitfallReality
The camera is the hard partCapturing pixels is solved; the semantic gap is the hard part
More rules would workImages vary without bound; a finite rule list never closes the gap
The model “sees” like youIt starts with numbers and computes a pattern match, not understanding
High accuracy = understandingIt found patterns that separate labels; that is useful, not comprehension

A machine starts with a grid of numbers, not a cat, and earns its way to the label, learned from examples rather than told by a rule.