From edges to objects

Last lesson gave us a single convolution: a small filter that slides across an image and lights up where its pattern appears. That finds an edge. But recognizing a cat, or a handwritten digit, or a tumor on a scan, takes more than finding edges. A cat is edges arranged into fur and ears and eyes, arranged into a face, arranged into an animal. The question this lesson answers is how a network climbs from “there is an edge here” all the way up to “this is a cat.” The answer is the same move that named this whole field: depth.

Stacking convolutions into a hierarchy

Recall a useful fact from last lesson: sliding a filter produces a feature map, and a feature map is just another grid of numbers. That means a second convolutional layer can run its own filters over the first layer’s feature maps, exactly as if those maps were the image. This is the key to the whole thing.

Now watch what each layer ends up detecting as you stack them:

The first layer runs filters on the raw pixels and finds the simplest patterns: edges and patches of color, oriented this way or that.
The second layer runs filters on the first layer’s edge-maps. A filter here is not looking at pixels; it is looking at combinations of edges. Two edges meeting makes a corner; a curved arrangement makes an arc. So the second layer finds corners, curves, and simple textures.
The third layer combines those into recognizable parts: an eye is a particular arrangement of curves, a wheel is a circle of a certain texture, a loop is what sits atop a handwritten 8.
Deeper layers combine parts into whole objects: two eyes, a nose, and fur in the right layout score high for “face.”

That is a hierarchy of features, simple at the bottom, complex at the top, each layer building on the one below. It is exactly the principle from the opening lesson of this track, that depth lets a network compose simple transformations into intricate ones, now made visual. No single layer understands “cat.” The understanding is built up, edge by edge, part by part, through depth.

Trace one concrete climb, the handwritten 8 from the very first track. The first layer finds the short curved edges that make up its outline. The second assembles those edges into two arc shapes. The third recognizes that two arcs stacked and closed form a pair of loops. The classifier at the top sees “two loops, one above the other” and reports: 8. Each step did something modest; the recognition is the whole tower working together.

There is a quiet reason the deeper layers can see bigger things even though their filters stay small. A layer-two filter looks at a small patch of layer-one’s maps, but each of those layer-one cells already summarized a small patch of the original image, so the layer-two filter indirectly depends on a larger region of the actual picture. Stack more layers and each filter, still tiny, effectively reaches across more and more of the image. That growing reach is what lets small filters end up recognizing whole objects.

A note of honesty: this clean “edges, then parts, then objects” story is well supported (researchers can visualize what real filters respond to, and the broad pattern of increasing abstraction with depth genuinely holds), but the individual filters a network actually learns are often messier and less nameable than “eye detector.” Hold the hierarchy as the real and useful picture, while knowing the tidy labels are a simplification of something fuzzier underneath.

Zooming out: pooling

There is one more piece that makes the hierarchy practical. As the network builds toward whole objects, it should stop caring about the exact pixel position of every little edge and start caring about the general arrangement. The step that does this is pooling, and the common version is dead simple: take a small region of a feature map, say a 2-by-2 square, and replace it with just its largest value.

region:  1  0          max-pool  →  3
         3  2

That single number, 3, keeps the strongest response in that neighborhood and throws away exactly where in the four cells it occurred. Do this across the whole feature map and it shrinks (fewer numbers to handle deeper in the network) and becomes a little more tolerant of small shifts (a feature that moves by one pixel still lands in the same pooled cell). Pooling is how the network gradually trades precise location for bigger-picture meaning as it goes up the stack.

Reading off the answer

After several rounds of convolution and pooling, the network has turned a grid of pixels into a compact set of high-level features: signals like “lots of fur-texture present,” “two eye-like parts,” “pointed-ear shape up top.” Something still has to turn those features into an actual answer.

That final step is a familiar face: a fully-connected layer, the very kind from the neural-network track, takes the high-level features and maps them to class scores, one per possible label, exactly like the digit network’s output layer. The difference is what it is reading. The digit network read raw pixels and struggled; this classifier reads rich, already-extracted features and has an easy job. The convolutional stack did the hard work of seeing; the classifier just names what was seen. A convolutional network, then, is a feature-building hierarchy with a classifier on top.

What convolutional networks are used for

This architecture, stacked convolutions plus pooling plus a classifier, is behind most of the machine vision you encounter:

Image classification: what is in this picture? (the digit and cat examples)
Object detection: what is in this picture, and where? (the boxes a camera draws around faces, or a car’s perception system around pedestrians)
Segmentation: which exact pixels belong to which object? (used in medical imaging to outline a region, and in photo tools to cut out a subject)
Recognition tasks broadly: faces, handwriting, plant species from a phone photo, defects on a production line.

Different jobs bolt different heads onto the same convolutional idea, but the feature-building hierarchy underneath is the shared engine.

Why this matters when you use AI

When a photo app instantly groups every picture of your dog, or a camera boxes faces in real time, you are watching a convolutional hierarchy at work: edges into parts into “dog,” computed in a flash. Knowing it is a layered build-up, rather than the machine “seeing” the way you do, explains both its strengths and its odd failures. It is genuinely good at recognizing patterns it was trained on, and it can be strangely brittle when an image breaks the local patterns it relies on, an unusual angle, an adversarially tweaked texture, an object in a context it never saw. The system recognized a pile of learned features, not a thing in the world. That distinction, which the limitations lesson in Phase 3 develops, is worth carrying every time a vision system seems uncannily smart or bafflingly wrong.

Common pitfalls

Thinking deeper layers see bigger filters. The filters stay small. What grows is what they are looking at: later filters run on earlier feature maps, so each one effectively summarizes a larger region of the original image even though the filter itself is still tiny.

Thinking pooling is where the learning happens. Pooling has no learned weights; it just shrinks and summarizes. The learning is in the convolutional filters and the final classifier.

Taking the “eye detector, ear detector” labels literally. The increasing abstraction with depth is real, but individual learned filters are usually messier than the tidy names suggest. The hierarchy is true; the clean labels are a helpful simplification.

Thinking the classifier is the clever part. The convolutional stack does the hard work of turning pixels into meaningful features. The final fully-connected classifier has an easy job precisely because the features handed to it are already rich.

What you should remember

Convolutions stack into a hierarchy: edges (early layers) combine into corners and textures, then parts, then whole objects (deeper layers), because each layer runs filters on the previous layer’s feature maps. This is depth composing simple patterns into complex ones.
Pooling zooms out: it shrinks feature maps (commonly by keeping the max in each small region) and trades exact position for general arrangement, with no learned weights of its own.
A classifier reads off the answer: a fully-connected layer (from the neural-network track) maps the final high-level features to class scores. The conv stack sees; the classifier names.
CNNs power most machine vision: classification, detection, segmentation, and recognition tasks all bolt different heads onto the same feature-building hierarchy.

One filter finds an edge; a tower of them, each reading the layer below, climbs from edges to objects. Seeing, for a machine, is depth turning local patterns into meaning.

Next: we leave recognition behind for something stranger and more creative. So far every network has classified, taken something in and labeled it. The next lesson asks whether a network can run the other way and generate, producing a new image or sentence rather than judging an existing one. That is the world of generative models.