Skip to content

Summary: How machines see: convolution

A fully-connected layer is the wrong tool for images: too many parameters, and it throws away the two facts that define a picture (nearby pixels form patterns, and a pattern can appear anywhere). The fix is the convolution: a small filter (a grid of weights) that slides across the image, computing a local weighted sum at each position to detect a pattern. Sharing one small filter across every position makes vision networks both efficient and able to find a feature wherever it appears. This is the scan-it-in-five-minutes version; the lesson works an edge detector by hand.

  • Fully-connected layers fail at images for three reasons: parameter explosion (784 to 784 is about 614,000 weights, on a tiny image), ignoring locality (neighboring pixels treated as unrelated), and no reuse (a pattern learned in one spot does nothing elsewhere). It is a shape problem; more neurons do not help.
  • A convolution slides a small filter. A filter (say 3x3) sits over a patch, multiplies each pixel by its matching weight, and sums to one number, the response. High response means the filter’s pattern is here.
  • Worked edge detector. The vertical-edge filter (-1 0 +1 per row) gives +3 on a dark-to-bright edge (it lights up) and 0 on a flat patch (it stays quiet). A reversed edge gives -3; the sign encodes the edge’s direction.
  • A feature map is the grid of responses from sliding one filter, one number per position. It is itself just numbers in a grid, so the next layer can run filters over it (the basis for stacking, next lesson).
  • Weight-sharing is the key win. The same filter is reused at every position, giving far fewer parameters (9 weights for a 3x3 filter, regardless of image size) and translation invariance (a pattern is found wherever it appears). This is the same weight-reuse trick recurrence used for sequences.
  • A layer uses many filters, each detecting a different pattern (one filter is blind to patterns it is not tuned for, a vertical-edge filter sums to 0 on a horizontal edge). And the filters are learned, not hand-set, by the same gradient descent and backpropagation from the previous track. The hand-picked edge filter was only to make the mechanism visible.

Convolution is behind essentially all machine vision: photo search, face detection, medical-imaging tools, the perception in self-driving research. Knowing how it works demystifies the whole category: a vision model is not gazing at a picture the way you do; it is sliding many small pattern-detectors across it and building up from what they find. That also hints at why such systems can be fooled by inputs that scramble local patterns in ways your eye would shrug off, a thread the limitations lesson in Phase 3 picks up. The next lesson stacks convolutions into a hierarchy and shows how edges build up into whole objects.

A fully-connected network stares at every pixel at once and learns nothing about where things are. A convolution slides a small, shared pattern-detector across the image, so it can find a feature anywhere with almost no extra cost.