Practice: Why seeing is hard for machines

Self-check

Seven short questions. Answer each in your head (or on paper) before opening the collapsible. Active retrieval is where the learning sticks; rereading feels productive but does much less.

1. When you hand a computer a color photo, what does it actually receive?

Show answer

A grid of numbers. Each pixel is three values (red, green, blue), each from 0 to 255. There is no object, edge, or meaning stored anywhere in the array, only brightness per pixel per channel.

2. How many numbers make up a 224-by-224 color image?

Show answer

224 times 224 times 3 = 150,528. (224 times 224 = 50,176 pixels, times 3 color channels.)

3. What is the semantic gap?

Show answer

The distance between the low-level pixel numbers a computer is handed and the high-level meaning a human reads off the same image instantly. You see “cat”; the machine holds 150,000 brightness values. Bridging that gap is the whole problem of computer vision.

4. Name four of the recognition challenges, and say why each one moves the pixels.

Show answer

Any four of: viewpoint (different angles share almost no pixel values), scale (near vs far changes the count and position of pixels), deformation (a non-rigid object takes endless shapes), occlusion (most of the object may be hidden), illumination (lighting swings every value up or down), background clutter (the object hides in a busy scene), intra-class variation (one category spans many looks). In each, the label stays constant while the numbers change a lot.

5. Why do hand-written rules fail at recognition?

Show answer

Every rule meets an image that breaks it. “A cat has two pointed ears” fails on a rear view, an occluded ear, a folded-ear breed, and so on. Each patch invites a new counterexample, because real images vary without bound while a rule list is finite.

6. What are the three steps of the data-driven approach?

Show answer

(1) Collect a large dataset of labeled images. (2) Train a model on it until its predictions match the labels. (3) Predict and evaluate on separate images the model has never seen, and measure accuracy.

7. Why does accuracy on unseen images matter more than accuracy on the training images?

Show answer

Because the unseen is what the system meets in the real world. Scoring well on the exact images it learned from could just be memorization; the real test is whether the learned patterns generalize to images it was never shown.

Try it yourself: count, classify, and break a rule

Three short exercises, paper only, about 12 minutes.

Part A: how many numbers? For each image, compute how many values the computer receives. (Grayscale = 1 number per pixel; color = 3.)

A 28-by-28 grayscale handwritten digit.
A 224-by-224 color photo.
A 1000-by-1000 color photo.

Answers

28 times 28 times 1 = 784.
224 times 224 times 3 = 150,528.
1000 times 1000 times 3 = 3,000,000.

The jump is the point: even a modest color photo is hundreds of thousands of numbers, and an ordinary phone photo is millions. Whatever turns those numbers into a label has a lot of input to make sense of.

Part B: name the challenge. Each scenario below breaks recognition in one main way. Name which recognition challenge it is.

The cat is photographed from directly behind.
Only the cat’s tail is visible; the rest is behind a chair.
A Sphynx (hairless) and a Persian (very fluffy) are both labeled “cat.”
The photo was taken in dim, orange lamplight.
A tabby lies on a tabby-patterned blanket.

Answers

Viewpoint variation. 2. Occlusion. 3. Intra-class variation. 4. Illumination. 5. Background clutter. (Scenario 3 and 5 can also touch others, but the main driver is named.)

Part C: break a rule. Here is a hand-written rule for detecting a dog: “A dog has four visible legs.” Describe two ordinary photos, both clearly of a dog, that this rule would get wrong.

What a good answer looks like

Examples: a dog sitting (two legs folded under, fewer than four visible); a dog lying down or curled up; a close-up of a dog’s face (no legs at all); a dog in tall grass or behind furniture (legs occluded); a dog swimming. Any of these is clearly a dog and breaks the four-legs rule, which is exactly why rule-writing collapses and the data-driven approach wins.

Flashcards

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What does a computer actually receive when handed a color image?

A grid of numbers: three values per pixel (red, green, blue), each 0 to 255. No objects or meaning, just brightness per pixel per channel.

Q. How many numbers are in a 224x224 color image?

150,528 (224 times 224 pixels, times 3 color channels).

Q. What is the semantic gap?

The distance between the raw pixel numbers a computer starts with and the high-level meaning a human reads off the same image instantly. Bridging it is the core problem of computer vision.

Q. Name the recognition challenges that make vision hard.

Viewpoint, scale, deformation, occlusion, illumination, background clutter, and intra-class variation. Each swings the pixel values dramatically while the correct label stays the same.

Q. Why do hand-written rules fail at image recognition?

Real images vary without bound, so every finite rule meets an image that breaks it. Each patch (“unless the ear is hidden…”) invites a new counterexample.

Q. What are the three steps of the data-driven approach?

Collect a large labeled dataset, train a model on it until predictions match the labels, then predict and evaluate on unseen images.

Q. Why is accuracy on unseen images the standard that matters?

Because the unseen is what the system faces in the real world. High accuracy only on the training images may be memorization; generalization is the real test.

Q. Why are vision systems uncannily good yet sometimes oddly wrong?

They learned patterns from examples rather than rules. They excel inside the distribution they were trained on and can fail confidently just outside it (odd angles, unusual lighting, tweaked patterns).

Q. What is the one-sentence takeaway of this lesson?

A machine starts with a grid of numbers, not a cat, and has to earn its way to the label, learned from thousands of examples rather than told by a rule.