Convolution and CNNs: brief

What you’ll learn

This is the Phase 2 opener (How machines see) and the first lesson of Track 16 where the architecture has visual knowledge built in. The one capability it builds: you will be able to compute a conv layer’s output and parameter count by hand, and explain exactly why a CNN is the right architecture for images where a fully-connected network is not. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 5 and grounds in the convolutional-networks course notes.

The lesson names the two structural problems with using fully-connected layers for image input (parameter explosion at 150,528 weights per FC neuron on a 224 image, plus no spatial prior at all), introduces the convolution operation, walks one filter (a vertical-edge detector) by hand on a 5 by 5 image, names the three hyperparameters (depth K, stride S, padding P), states the exact output spatial-size formula (W - F + 2P) / S + 1, explains weight sharing as the vision-appropriate prior, and counts the parameter savings (a CNN layer’s parameter count is decoupled from input image size; AlexNet’s first conv layer is 34,944 parameters for any input dimensions).

Where this fits

This is lesson 5 of 16, the first lesson of Phase 2. It depends on lesson 4 (the neural network and the training loop, both of which carry over unchanged on top of conv layers). The next lesson, The architectures that cracked vision: AlexNet to ResNet, stacks the conv layer of this lesson into deep networks and folds in a section on what it takes to train them at scale. Phase 2 closes with sequence-tools for vision, detection and segmentation, and video understanding.

Before you start

Prerequisites: lesson 4 of this track (Neural networks and backprop). You need the L4 training loop in your head; this lesson swaps in a different operation at one layer and leaves the rest of the loop unchanged. Track 11 lessons 1 and 2 (the digit-as-grid-of-numbers picture) and Track 12 lessons 4 and 5 (survey-level “convolution” and “edges to objects”) are useful soft background; this lesson is the more formal treatment.

About the math

Light, with one small worked filter. The body computes a 3 by 3 output feature map from a 5 by 5 grayscale input convolved with a 3 by 3 vertical-edge filter (output: nine integers, all from elementary multiply-and-add). Practice repeats it with a horizontal-edge filter, then runs four output-size-formula problems and a parameter-count comparison. No calculus; multiplication, addition, and integer arithmetic only.

By the end, you’ll be able to

Name two structural problems with FC layers for images and give a numerical example for each
Describe convolution in one sentence and apply it to a small worked image by hand
Compute output spatial size with (W - F + 2P) / S + 1 and recognize invalid configurations
Explain weight sharing as the vision prior, with its two consequences
Compute and compare conv vs FC parameter counts on the same input

Time and difficulty

Read time: about 14 minutes
Practice time: about 18 minutes (a fresh horizontal-edge convolution by hand, four output-size problems, a CIFAR-10 + ImageNet-scale parameter-count comparison, plus flashcards)
Difficulty: standard (the math is multiply-and-add and one integer formula; the conceptual lift is seeing why weight sharing changes the shape of the parameter cost)