Cheatsheet: From edges to objects
The one idea that matters
Section titled “The one idea that matters”stack convolutions → a hierarchy of features: edges → corners & textures → parts → whole objectseach layer runs filters on the previous layer's feature mapsDepth composing simple patterns into complex ones (Lesson 1’s principle, made visual).
The hierarchy, layer by layer
Section titled “The hierarchy, layer by layer”| Layer | Runs filters on | Detects |
|---|---|---|
| 1 | raw pixels | edges, color patches |
| 2 | layer-1 edge maps | corners, curves, textures |
| 3 | layer-2 maps | parts (an eye, a wheel, a loop) |
| deeper | parts | whole objects (a face, an 8) |
Worked trace (handwritten 8): edges → two arcs → two stacked closed loops → classifier reads “two loops” → 8.
Receptive field (why small filters see big things)
Section titled “Receptive field (why small filters see big things)”Filters stay tiny at every layer. But a layer-2 filter reads layer-1 cells that each already summarized a patch of the image, so it indirectly depends on a larger region. Stack more layers → each small filter effectively reaches across more of the picture.
Pooling (zoom out)
Section titled “Pooling (zoom out)”Take a small region of a feature map, keep its strongest value, drop the rest:
1 0 → max-pool → 33 2- Shrinks the maps (less to handle deeper).
- Trades exact position for general arrangement (small shifts tolerated).
- No learned weights of its own.
Reading off the answer
Section titled “Reading off the answer”After conv + pool rounds, a fully-connected classifier (from the neural-network track) maps the high-level features to class scores. The conv stack sees; the classifier names. A CNN = feature-building hierarchy + classifier on top.
What CNNs are used for
Section titled “What CNNs are used for”- Classification: what is in the picture?
- Detection: what, and where (bounding boxes)?
- Segmentation: which pixels belong to which object?
- Recognition: faces, handwriting, species, production-line defects.
Same hierarchy underneath; different heads on top.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- “Deeper layers use bigger filters.” No. Filters stay small; the region they effectively cover grows (receptive field).
- “Pooling is where learning happens.” No. Pooling has no weights; it shrinks and summarizes. Learning is in filters + classifier.
- “Eye detector / ear detector are literal.” Increasing abstraction is real; individual learned filters are messier than the tidy names.
- “The classifier is the clever part.” No. The conv stack does the hard seeing; the classifier’s job is easy because the features are already rich.
The one-line version
Section titled “The one-line version”One filter finds an edge; a tower of them, each reading the layer below, climbs from edges to objects. Seeing, for a machine, is depth turning local patterns into meaning.