From edges to objects: cheatsheet

The one idea that matters

stack convolutions → a hierarchy of features:
  edges  →  corners & textures  →  parts  →  whole objects
each layer runs filters on the previous layer's feature maps

Depth composing simple patterns into complex ones (Lesson 1’s principle, made visual).

The hierarchy, layer by layer

Layer	Runs filters on	Detects
1	raw pixels	edges, color patches
2	layer-1 edge maps	corners, curves, textures
3	layer-2 maps	parts (an eye, a wheel, a loop)
deeper	parts	whole objects (a face, an 8)

Worked trace (handwritten 8): edges → two arcs → two stacked closed loops → classifier reads “two loops” → 8.

Receptive field (why small filters see big things)

Filters stay tiny at every layer. But a layer-2 filter reads layer-1 cells that each already summarized a patch of the image, so it indirectly depends on a larger region. Stack more layers → each small filter effectively reaches across more of the picture.

Pooling (zoom out)

Take a small region of a feature map, keep its strongest value, drop the rest:

1  0   →  max-pool  →  3
3  2

Shrinks the maps (less to handle deeper).
Trades exact position for general arrangement (small shifts tolerated).
No learned weights of its own.

Reading off the answer

After conv + pool rounds, a fully-connected classifier (from the neural-network track) maps the high-level features to class scores. The conv stack sees; the classifier names. A CNN = feature-building hierarchy + classifier on top.

What CNNs are used for

Classification: what is in the picture?
Detection: what, and where (bounding boxes)?
Segmentation: which pixels belong to which object?
Recognition: faces, handwriting, species, production-line defects.

Same hierarchy underneath; different heads on top.

Pitfalls to dodge

“Deeper layers use bigger filters.” No. Filters stay small; the region they effectively cover grows (receptive field).
“Pooling is where learning happens.” No. Pooling has no weights; it shrinks and summarizes. Learning is in filters + classifier.
“Eye detector / ear detector are literal.” Increasing abstraction is real; individual learned filters are messier than the tidy names.
“The classifier is the clever part.” No. The conv stack does the hard seeing; the classifier’s job is easy because the features are already rich.

The one-line version

One filter finds an edge; a tower of them, each reading the layer below, climbs from edges to objects. Seeing, for a machine, is depth turning local patterns into meaning.