From edges to objects, in brief

What you’ll learn

This lesson closes Track 12’s vision pair by answering the question the last one set up: one filter finds an edge, but how does a network climb from edges all the way to “this is a cat”? The answer is depth, stacking convolutions into a hierarchy. The source curriculum is MIT 6.S191, Lecture 3, by Alexander and Ava Amini, freely available at introtodeeplearning.com.

You will see how each layer runs filters on the previous layer’s feature maps, so edges combine into corners and textures, then parts, then whole objects; understand the receptive field (why small filters reach across larger regions as depth grows); meet pooling (the zoom-out step, with no learned weights of its own); see how a fully-connected classifier reads the final features into an answer; and learn what convolutional networks are actually used for.

Where this fits

This is lesson 5 of 10, closing Phase 2’s vision pair. It builds directly on the previous lesson’s single convolution, so that lesson is the prerequisite. The next lesson turns the arrow around from recognition to generation, opening the generative half of the phase.

Before you start

Prerequisites: lesson 4 of this track (the single convolution, feature maps, weight-sharing), which this lesson stacks into a hierarchy. The neural-network basics from the previous track are assumed, especially the fully-connected layer, which reappears here as the classifier on top.

About the math

Light and concrete. The only arithmetic is a max-pooling exercise (take the largest value in each small region), plus reasoning about the convolution hierarchy. No calculus or formulas; the practice section has you pool a small feature map by hand and trace a recognition hierarchy.

By the end, you’ll be able to

Explain how stacking convolutions builds a hierarchy from edges to parts to whole objects
Explain the receptive field (why small filters reach across larger regions as depth grows)
Describe what pooling does (shrink and tolerate small shifts) and that it has no learned weights
Describe how a fully-connected classifier turns the final features into an answer, and name what CNNs are used for

Time and difficulty

Read time: about 9 minutes
Practice time: about 10 minutes (a by-hand max-pooling exercise and a hierarchy trace, plus flashcards)
Difficulty: standard (one small by-hand calculation; otherwise conceptual)