Skip to content

Cheatsheet: How machines see: convolution

convolution = slide a small filter (grid of weights) across the image
at each position: local weighted sum = "is my pattern here?"
the same filter is reused everywhere (weight-sharing)

Why fully-connected layers are wrong for images

Section titled “Why fully-connected layers are wrong for images”
ProblemWhy
Too many parameters784 → 784 fully connected = ~614,000 weights, on a tiny image; real photos explode this
Ignores localityTreats neighboring pixels as unrelated, but shapes ARE local arrangements
No translation invarianceA pattern learned in one spot must be relearned everywhere else

A filter (kernel), e.g. 3x3, slides over the image. At each spot: multiply each pixel by the matching weight, sum to one number = the response there. High response = the filter’s pattern is present.

Filter (vertical edge, bright-right):

-1 0 +1
-1 0 +1
-1 0 +1
PatchEach rowTotalReading
edge: 0 0 1 / 0 0 1 / 0 0 10·-1 + 0·0 + 1·1 = 13edge detected
flat: 1 1 1 / 1 1 1 / 1 1 11·-1 + 1·0 + 1·1 = 00no edge

Lights up on its pattern, quiet otherwise.

The same filter weights are reused at every position (like recurrence reusing weights at every step). Two payoffs:

  • Far fewer parameters: one 3x3 filter = 9 numbers, regardless of image size (vs ~614,000).
  • Translation invariance: the pattern is found wherever it appears, learned once.
  • Sliding one filter produces a feature map: a grid marking where its pattern was found.
  • A layer uses many filters, each making its own map; the output is the stack of maps.
  • Filters are learned, not hand-set, by the same gradient descent + backprop from the neural-network track. (The hand-picked edge filter was just for illustration.)
  • Stacking layers (next lesson) builds edges → parts → objects.
  • “Convolution sees the whole image at once.” No. It looks at one small patch at a time and slides.
  • “Each position has its own weights.” No. One filter’s weights are shared across all positions.
  • “A filter is part of the picture.” No. It is a small grid of learned weights, a pattern-detector.
  • “One filter is enough.” No. Many filters per layer, many layers stacked.
  • Filter / kernel: a small grid of weights that detects one local pattern.
  • Convolution: sliding a filter across the image, computing a local weighted sum at each position.
  • Feature map: the grid of responses produced by sliding one filter.
  • Weight-sharing: reusing one filter’s weights at every position (fewer params + translation invariance).

A convolution slides a small, shared pattern-detector across an image, so it can find a feature anywhere with almost no extra cost.