Skip to content

Cheatsheet: From edges to objects

stack convolutions → a hierarchy of features:
edges → corners & textures → parts → whole objects
each layer runs filters on the previous layer's feature maps

Depth composing simple patterns into complex ones (Lesson 1’s principle, made visual).

LayerRuns filters onDetects
1raw pixelsedges, color patches
2layer-1 edge mapscorners, curves, textures
3layer-2 mapsparts (an eye, a wheel, a loop)
deeperpartswhole objects (a face, an 8)

Worked trace (handwritten 8): edges → two arcs → two stacked closed loops → classifier reads “two loops” → 8.

Receptive field (why small filters see big things)

Section titled “Receptive field (why small filters see big things)”

Filters stay tiny at every layer. But a layer-2 filter reads layer-1 cells that each already summarized a patch of the image, so it indirectly depends on a larger region. Stack more layers → each small filter effectively reaches across more of the picture.

Take a small region of a feature map, keep its strongest value, drop the rest:

1 0 → max-pool → 3
3 2
  • Shrinks the maps (less to handle deeper).
  • Trades exact position for general arrangement (small shifts tolerated).
  • No learned weights of its own.

After conv + pool rounds, a fully-connected classifier (from the neural-network track) maps the high-level features to class scores. The conv stack sees; the classifier names. A CNN = feature-building hierarchy + classifier on top.

  • Classification: what is in the picture?
  • Detection: what, and where (bounding boxes)?
  • Segmentation: which pixels belong to which object?
  • Recognition: faces, handwriting, species, production-line defects.

Same hierarchy underneath; different heads on top.

  • “Deeper layers use bigger filters.” No. Filters stay small; the region they effectively cover grows (receptive field).
  • “Pooling is where learning happens.” No. Pooling has no weights; it shrinks and summarizes. Learning is in filters + classifier.
  • “Eye detector / ear detector are literal.” Increasing abstraction is real; individual learned filters are messier than the tidy names.
  • “The classifier is the clever part.” No. The conv stack does the hard seeing; the classifier’s job is easy because the features are already rich.

One filter finds an edge; a tower of them, each reading the layer below, climbs from edges to objects. Seeing, for a machine, is depth turning local patterns into meaning.