Practice: How machines see: convolution

Self-check

Seven short questions. Try to answer each in your head (or on paper) before opening the collapsible. Active retrieval is where the learning sticks; rereading is comfortable but does much less.

1. Why is a fully-connected layer the wrong tool for images?

Show answer

Three reasons: it has too many parameters (784 to 784 is about 614,000 weights on a tiny image, and real photos explode this), it ignores locality (it treats neighboring pixels as unrelated, but shapes are local arrangements of pixels), and it cannot reuse what it learns (a pattern learned in one spot does nothing for another, so it relearns at every position). It is a shape problem, not a size problem.

2. What does a convolution filter do as it slides across an image?

Show answer

At each position it sits over a small patch, multiplies each pixel by its matching filter weight, and sums to a single number. A high response means the filter’s pattern is present at that spot; near zero means it is absent. The filter is a small pattern-detector.

3. Take the vertical-edge filter from the lesson. What does it output on a dark-left/bright-right edge patch, and on a flat all-bright patch?

Show answer

On the edge patch (0 0 1 per row), each row gives 0·-1 + 0·0 + 1·1 = 1, total 3: the filter lights up. On the flat patch (1 1 1 per row), each row gives 1·-1 + 1·0 + 1·1 = 0, total 0: it stays quiet. Strong where its pattern is present, near zero where absent.

4. What is weight-sharing, and what two payoffs does it give?

Show answer

The same filter (its same handful of weights) is reused at every position as it slides. Payoffs: far fewer parameters (one 3x3 filter is 9 numbers regardless of image size, versus ~614,000 for the fully-connected layer) and translation invariance (a pattern is found wherever it appears, learned once). It is the same weight-reuse trick recurrence used for sequences.

5. What is a feature map?

Show answer

The grid of responses you get by sliding one filter across the whole image, one number per position. Its bright spots mark where that filter’s pattern was found. A feature map is itself just numbers in a grid, so the next layer can run its own filters over it.

6. The lesson used a hand-picked edge filter. In a real network, where do filter weights come from?

Show answer

They are learned, not hand-set. The filters are just more weights, trained exactly the way every weight in the previous track was: gradient descent and backpropagation driving down a cost. The network discovers for itself which patterns are worth detecting; the tidy edge filter was only to make the mechanism visible.

7. Fill in the blank. “A convolution slides a small, ______ pattern-detector across the image, so it can find a feature ______ with almost no extra cost.”

Show answer

Shared and anywhere. Sharing the filter across positions is what makes it both efficient (few parameters) and translation-invariant (a feature is found wherever it appears).

Try it yourself: run the filter by hand on fresh patches

You did the edge patch and the flat patch in the lesson. Now run the same vertical-edge filter on two new patches and predict what each tells you. Paper arithmetic, about 10 minutes.

The filter (vertical edge, bright on the right):

-1  0  +1
-1  0  +1
-1  0  +1

Patch A (a reversed edge: bright on the left, dark on the right):

1  1  0
1  1  0
1  1  0

Compute the response (multiply each cell by the matching filter weight, sum all nine). Predict the sign before you add.

What you’ll get

Each row gives 1·-1 + 1·0 + 0·1 = -1, and three rows give -3. A strong negative response. The same filter that gave +3 on a dark-to-bright edge gives -3 on a bright-to-dark edge. The sign tells you the direction of the edge: positive for one orientation, negative for the reverse. A filter does not just detect “an edge”; it detects a specific edge.

Patch B (a horizontal edge: bright on top, dark on the bottom):

1  1  1
1  1  1
0  0  0

Run the vertical-edge filter on it. Predict the result before computing.

What you’ll get

Every row gives (value)·-1 + (value)·0 + (value)·1 = 0 (each row is uniform left-to-right), so the total is 0. The vertical-edge filter is blind to a horizontal edge. This is the key point about filters: each one detects its own specific pattern and ignores others. To find horizontal edges too, you need a different filter, which is exactly why a real layer uses many filters side by side.

Bonus: the parameter count. That 3x3 filter has how many weights, and how does it compare to a fully-connected layer mapping 784 inputs to 784 neurons?

What you’ll get

The filter has 9 weights, no matter how large the image is. The fully-connected layer has 784 x 784 = 614,656. That gap, and the fact that it grows with image size, is why convolution (not full connection) is how networks are wired to see.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. Why is a fully-connected layer wrong for images?

Too many parameters (~614,000 for 784 to 784, on a tiny image), it ignores that neighboring pixels form patterns, and it cannot reuse what it learns across positions. A shape problem, not a size problem.

Q. What is a convolution?

Sliding a small filter (a grid of weights) across an image; at each position, multiply the local patch by the filter and sum to one number. High response means the filter’s pattern is present there.

Q. What does the vertical-edge filter output on an edge vs a flat patch?

About +3 on a dark-to-bright vertical edge (it lights up) and 0 on a flat patch (it stays quiet). A reversed edge gives -3; the sign encodes edge direction.

Q. What is weight-sharing, and why does it matter?

The same filter weights are reused at every position. It gives far fewer parameters (one 3x3 filter is 9 weights regardless of image size) and translation invariance (a pattern is found wherever it appears).

Q. What is a feature map?

The grid of responses from sliding one filter across the whole image, one number per position. Its bright spots mark where the filter’s pattern was found, and it can feed into the next layer.

Q. Are a network's convolution filters hand-designed?

No. They are learned by gradient descent and backpropagation, the same as any weights. The network discovers which patterns to detect; hand-picked filters (like the edge detector) are only for illustration.

Q. Why does a single filter give 0 on a pattern it is not tuned for?

Each filter detects one specific pattern. A vertical-edge filter, run on a horizontal edge, sums to 0, it is blind to it. That is why a layer uses many filters side by side, each for a different pattern.

Q. What does translation invariance mean for a CNN?

Because the same filter scans every location, a feature is detected wherever it appears in the image, learned once and applied everywhere, with no need to relearn it per position.

Q. How many filters does a real convolutional layer use?

Many, side by side, each with its own weights detecting a different pattern (vertical edges, horizontal edges, a curve). Each produces its own feature map; the layer’s output is the stack of maps.

Q. What is the one-sentence takeaway of this lesson?

A convolution slides a small, shared pattern-detector across an image, so it can find a feature anywhere with almost no extra cost.