Practice: From edges to objects

Self-check

Seven short questions. Try to answer each in your head (or on paper) before opening the collapsible. Active retrieval is where the learning sticks; rereading is comfortable but does much less.

1. How can a second convolutional layer detect combinations of edges, when a convolution only sees a small patch?

Show answer

Because the first layer’s output is a feature map, just another grid of numbers, the second layer runs its filters over those maps as if they were the image. So a second-layer filter is not looking at pixels; it is looking at arrangements of the edges the first layer found (two edges meeting is a corner, a curved arrangement is an arc).

2. Walk the hierarchy: roughly what does each layer detect, from first to deep?

Show answer

First layer: edges and color patches. Second layer: corners, curves, simple textures (combinations of edges). Third layer: parts (an eye, a wheel, a loop). Deeper layers: whole objects (a face, a digit). Simple at the bottom, complex at the top, each layer building on the one below.

3. Trace the handwritten 8 through the hierarchy.

Show answer

First layer finds the short curved edges of its outline. Second assembles those into two arc shapes. Third recognizes two stacked, closed arcs as a pair of loops. The classifier at the top reads “two loops, one above the other” and reports 8. Each step is modest; the recognition is the whole tower working together.

4. The filters stay small at every layer, yet deeper layers “see” bigger parts of the image. How?

Show answer

Receptive field. A second-layer filter reads a small patch of first-layer cells, but each of those cells already summarized a patch of the original image, so the filter indirectly depends on a larger region. Stack more layers and each small filter effectively reaches across more of the picture, which is how tiny filters end up recognizing whole objects.

5. What does pooling do, and does it have learned weights?

Show answer

Pooling shrinks a feature map and makes it a little position-tolerant: take a small region (say 2x2) and keep just its largest value (max-pooling), discarding exactly where in the region it occurred. It trades precise location for general arrangement as you go up the stack. It has no learned weights; it is a fixed shrink-and-summarize step.

6. After the convolution-and-pooling stack, what turns the extracted features into an actual answer?

Show answer

A fully-connected classifier (the kind from the neural-network track) maps the final high-level features to class scores, one per label. The convolutional stack does the hard work of seeing; the classifier just names what was seen. Its job is easy precisely because the features handed to it are already rich.

7. True or false: the “edge detector, eye detector” labels are literally what each filter learns.

Show answer

Roughly false, with a true core. The increasing abstraction with depth is real and well supported, but the individual filters a network actually learns are usually messier and less nameable than tidy labels suggest. Hold the hierarchy as the real picture; treat the clean labels as a helpful simplification.

Try it yourself: pool by hand, then trace a hierarchy

Two short exercises, paper only, about 10 minutes.

Part A: max-pooling. Here is a 4x4 feature map (the kind of grid a convolution produces). Apply 2x2 max-pooling: split it into four non-overlapping 2x2 blocks and replace each block with its largest value. Predict the 2x2 result.

What you’ll get

3  3
5  4

Top-left block {1,0,3,2} keeps 3; top-right {2,3,1,0} keeps 3; bottom-left {0,1,2,5} keeps 5; bottom-right {4,1,0,2} keeps 4. The map shrank from 16 numbers to 4, keeping the strongest response in each region and forgetting exactly where in the block it sat. That is pooling: smaller, and a little more tolerant of small shifts. Notice you used no weights, just “take the max.”

Part B: trace a hierarchy. Pick a simple object (a coffee mug, the letter A, a smiley face). Sketch, in words, what each of four layers would plausibly detect on its way to recognizing it.

What a good answer looks like (smiley face)

Layer 1: short edges and arcs. Layer 2: corners and curve segments (the curve of the mouth, the round of an eye). Layer 3: parts (two eye-blobs, one curved mouth). Layer 4 / classifier: “two eyes above a curved mouth, inside a circle” scores high for a face. The point is not the exact labels (real filters are messier) but the shape of the climb: simple parts combining into bigger ones through depth.

Part C (reasoning). Why does pooling having no learned weights still help the network learn?

What you should notice

Pooling does not learn anything itself, but by shrinking the maps and tolerating small shifts, it lets the layers that do learn (the convolutional filters and the classifier) work on smaller, more position-stable inputs. It is plumbing that makes the learnable parts more effective, not a learner itself.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. How do stacked convolutions build a hierarchy?

Each layer runs filters on the previous layer’s feature maps. So edges (layer 1) combine into corners and textures (layer 2), then parts (layer 3), then whole objects (deeper). Depth composes simple patterns into complex ones.

Q. Trace the handwritten 8 through a CNN.

Edges of the outline (layer 1) -> two arc shapes (layer 2) -> two stacked closed loops (layer 3) -> classifier reads “two loops” and reports 8. Each step modest; the recognition is the whole tower.

Q. What is the receptive field, and why does it grow with depth?

The region of the original image a filter effectively depends on. Filters stay small, but a deeper filter reads cells that each already summarized a patch, so its reach compounds layer by layer until small filters recognize whole objects.

Q. What does max-pooling do?

Takes a small region of a feature map (e.g. 2x2) and keeps only its largest value. It shrinks the map and trades precise position for general arrangement. It has no learned weights.

Q. Does pooling have learned weights?

No. It is a fixed shrink-and-summarize step (take the max in each region). The learning lives in the convolutional filters and the final classifier.

Q. What turns a CNN's features into an answer?

A fully-connected classifier maps the final high-level features to class scores. The conv stack sees; the classifier names. Its job is easy because the features are already rich.

Q. What is a convolutional network, in one line?

A feature-building hierarchy (stacked convolutions and pooling) with a classifier on top.

Q. Are the 'eye detector / ear detector' filter labels literal?

Not quite. The increasing abstraction with depth is real and well supported, but individual learned filters are messier than tidy labels. The hierarchy is true; the clean names are a simplification.

Q. What are CNNs used for?

Image classification (what is it?), object detection (what and where?), segmentation (which pixels?), and recognition tasks (faces, handwriting, defects). Different heads on the same feature-building hierarchy.

Q. What is the one-sentence takeaway of this lesson?

One filter finds an edge; a tower of them, each reading the layer below, climbs from edges to objects. Seeing, for a machine, is depth turning local patterns into meaning.