Summary: From edges to objects

One filter finds an edge, but a cat is not an edge. Stacking convolutions builds a hierarchy: early layers find edges, later layers combine them into corners and textures, then parts, then whole objects. Pooling shrinks the maps and trades precise position for general arrangement, a fully-connected classifier reads the final features into an answer, and the whole thing is how a network climbs from “there is an edge here” to “this is a cat.” This is the scan-it-in-five-minutes version; the lesson traces the climb concretely.

Core ideas

Stacking builds a hierarchy. A feature map is just a grid of numbers, so a second convolutional layer runs filters over the first layer’s maps. The result: edges (layer 1) combine into corners and textures (layer 2), then parts (layer 3), then whole objects (deeper layers). This is depth composing simple patterns into complex ones, the lesson-1 principle made visual.
A concrete climb (the handwritten 8): edges of the outline, then two arc shapes, then two stacked closed loops, then the classifier reads “two loops” and reports 8. Each step is modest; the recognition is the whole tower together.
Receptive field. Filters stay small at every layer, but a deeper filter reads cells that each already summarized a patch, so its effective reach compounds with depth. That growing reach is how tiny filters end up recognizing whole objects.
Pooling zooms out. Take a small region of a feature map (say 2x2) and keep its largest value (max-pooling). It shrinks the map and tolerates small shifts, trading exact position for general arrangement. It has no learned weights; it is fixed plumbing.
A classifier reads off the answer. After the convolution-and-pooling stack, a fully-connected layer (from the neural-network track) maps the final high-level features to class scores. The conv stack sees; the classifier names. A CNN is a feature-building hierarchy with a classifier on top.
An honest caveat. The clean “edges, then parts, then objects” story is well supported (researchers can visualize what filters respond to), but individual learned filters are messier than tidy “eye detector” labels. The hierarchy is real; the labels are a simplification.

What changes for you

This architecture, stacked convolutions plus pooling plus a classifier, is behind most machine vision: image classification, object detection (boxes around faces or pedestrians), segmentation (which pixels belong to which object, used in medical imaging and photo tools), and recognition broadly. Different jobs bolt different heads onto the same hierarchy. Knowing it is a layered build-up, not a machine “seeing” the way you do, explains both its strengths and its odd failures: it recognizes a pile of learned features, which is why an unusual angle or a scrambled texture can fool it (a thread the limitations lesson picks up). The next lesson turns the arrow around, from networks that recognize to networks that generate.

One filter finds an edge; a tower of them, each reading the layer below, climbs from edges to objects. Seeing, for a machine, is depth turning local patterns into meaning.