Cheatsheet: How machines see: convolution
The one idea that matters
Section titled “The one idea that matters”convolution = slide a small filter (grid of weights) across the image at each position: local weighted sum = "is my pattern here?"the same filter is reused everywhere (weight-sharing)Why fully-connected layers are wrong for images
Section titled “Why fully-connected layers are wrong for images”| Problem | Why |
|---|---|
| Too many parameters | 784 → 784 fully connected = ~614,000 weights, on a tiny image; real photos explode this |
| Ignores locality | Treats neighboring pixels as unrelated, but shapes ARE local arrangements |
| No translation invariance | A pattern learned in one spot must be relearned everywhere else |
The convolution operation
Section titled “The convolution operation”A filter (kernel), e.g. 3x3, slides over the image. At each spot: multiply each pixel by the matching weight, sum to one number = the response there. High response = the filter’s pattern is present.
Worked: an edge detector
Section titled “Worked: an edge detector”Filter (vertical edge, bright-right):
-1 0 +1-1 0 +1-1 0 +1| Patch | Each row | Total | Reading |
|---|---|---|---|
edge: 0 0 1 / 0 0 1 / 0 0 1 | 0·-1 + 0·0 + 1·1 = 1 | 3 | edge detected |
flat: 1 1 1 / 1 1 1 / 1 1 1 | 1·-1 + 1·0 + 1·1 = 0 | 0 | no edge |
Lights up on its pattern, quiet otherwise.
Weight-sharing (the key win)
Section titled “Weight-sharing (the key win)”The same filter weights are reused at every position (like recurrence reusing weights at every step). Two payoffs:
- Far fewer parameters: one 3x3 filter = 9 numbers, regardless of image size (vs ~614,000).
- Translation invariance: the pattern is found wherever it appears, learned once.
Feature maps and many filters
Section titled “Feature maps and many filters”- Sliding one filter produces a feature map: a grid marking where its pattern was found.
- A layer uses many filters, each making its own map; the output is the stack of maps.
- Filters are learned, not hand-set, by the same gradient descent + backprop from the neural-network track. (The hand-picked edge filter was just for illustration.)
- Stacking layers (next lesson) builds edges → parts → objects.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- “Convolution sees the whole image at once.” No. It looks at one small patch at a time and slides.
- “Each position has its own weights.” No. One filter’s weights are shared across all positions.
- “A filter is part of the picture.” No. It is a small grid of learned weights, a pattern-detector.
- “One filter is enough.” No. Many filters per layer, many layers stacked.
Words to use precisely
Section titled “Words to use precisely”- Filter / kernel: a small grid of weights that detects one local pattern.
- Convolution: sliding a filter across the image, computing a local weighted sum at each position.
- Feature map: the grid of responses produced by sliding one filter.
- Weight-sharing: reusing one filter’s weights at every position (fewer params + translation invariance).
The one-line version
Section titled “The one-line version”A convolution slides a small, shared pattern-detector across an image, so it can find a feature anywhere with almost no extra cost.