How machines see local patterns, convolution
What you’ll learn
Section titled “What you’ll learn”This is the Phase 2 opener (How machines see) and the first lesson of Track 16 where the architecture has visual knowledge built in. The one capability it builds: you will be able to compute a conv layer’s output and parameter count by hand, and explain exactly why a CNN is the right architecture for images where a fully-connected network is not. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 5 and grounds in the convolutional-networks course notes.
The lesson names the two structural problems with using fully-connected layers for image input (parameter explosion at 150,528 weights per FC neuron on a 224 image, plus no spatial prior at all), introduces the convolution operation, walks one filter (a vertical-edge detector) by hand on a 5 by 5 image, names the three hyperparameters (depth K, stride S, padding P), states the exact output spatial-size formula (W - F + 2P) / S + 1, explains weight sharing as the vision-appropriate prior, and counts the parameter savings (a CNN layer’s parameter count is decoupled from input image size; AlexNet’s first conv layer is 34,944 parameters for any input dimensions).
Where this fits
Section titled “Where this fits”This is lesson 5 of 16, the first lesson of Phase 2. It depends on lesson 4 (the neural network and the training loop, both of which carry over unchanged on top of conv layers). The next lesson, The architectures that cracked vision: AlexNet to ResNet, stacks the conv layer of this lesson into deep networks and folds in a section on what it takes to train them at scale. Phase 2 closes with sequence-tools for vision, detection and segmentation, and video understanding.
Before you start
Section titled “Before you start”Prerequisites: lesson 4 of this track (Neural networks and backprop). You need the L4 training loop in your head; this lesson swaps in a different operation at one layer and leaves the rest of the loop unchanged. Track 11 lessons 1 and 2 (the digit-as-grid-of-numbers picture) and Track 12 lessons 4 and 5 (survey-level “convolution” and “edges to objects”) are useful soft background; this lesson is the more formal treatment.
About the math
Section titled “About the math”Light, with one small worked filter. The body computes a 3 by 3 output feature map from a 5 by 5 grayscale input convolved with a 3 by 3 vertical-edge filter (output: nine integers, all from elementary multiply-and-add). Practice repeats it with a horizontal-edge filter, then runs four output-size-formula problems and a parameter-count comparison. No calculus; multiplication, addition, and integer arithmetic only.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Name two structural problems with FC layers for images and give a numerical example for each
- Describe convolution in one sentence and apply it to a small worked image by hand
- Compute output spatial size with
(W - F + 2P) / S + 1and recognize invalid configurations - Explain weight sharing as the vision prior, with its two consequences
- Compute and compare conv vs FC parameter counts on the same input
Time and difficulty
Section titled “Time and difficulty”- Read time: about 14 minutes
- Practice time: about 18 minutes (a fresh horizontal-edge convolution by hand, four output-size problems, a CIFAR-10 + ImageNet-scale parameter-count comparison, plus flashcards)
- Difficulty: standard (the math is multiply-and-add and one integer formula; the conceptual lift is seeing why weight sharing changes the shape of the parameter cost)