Summary: Convolution and CNNs

Phase 1 left us with a working general-purpose classifier; this lesson opens Phase 2 by replacing one layer with something that actually knows it is looking at an image. A fully-connected layer fails for images on two counts (parameter explosion, no spatial prior). The replacement is the convolution: a small learned filter slides across the input, computing a dot product with each local patch, producing a feature map of where its pattern showed up. Many filters per layer give a stack of feature maps; the same filter weights are used at every spatial position (weight sharing), which both slashes the parameter count and gives translation equivariance for free. The training loop on top is unchanged.

Core ideas

FC layers are wrong for images. Single FC neuron on 224x224x3 input = 150,528 weights; modest hidden layer = millions. Plus FC treats every pixel-pair as unrelated, so it must learn each position-specific detector from scratch, ignoring the translational structure of images.
Convolution = small filter sliding spatially over the input. Filter is small spatially (3, 5, 7), full depth of input. At each position: dot product of filter with the local patch -> one number. Grid of those numbers is the filter’s feature (activation) map.
Many filters per layer -> a stack of feature maps. With K filters, the layer outputs a 3D volume of depth K. Each slice says where its filter pattern occurred in the input.
Output spatial-size formula: (W - F + 2P) / S + 1. W = input size, F = filter size, S = stride, P = padding. Must be a whole number to be valid.
Weight sharing baked in. Same filter weights at every spatial position. Vision-appropriate prior (patterns useful at one position are useful at any), and it gives translation equivariance for free.
Parameter count is decoupled from image size. K * (F * F * D_in) weights + K biases per layer. AlexNet’s first layer: 96 * 11 * 11 * 3 + 96 = 34,944 parameters, the same number for any input image dimensions.
The training loop is unchanged. Forward pass uses the conv operation; backprop carries gradients through it; gradient descent steps on filter weights. Lesson 4’s four-step loop runs over CNNs without modification.

What changes for you

This is the first lesson where the architecture has visual knowledge built in. Every modern image classifier, detector, segmenter, and (often) generator either uses convolutions or their descendant the vision transformer, which we will meet later. The “deep” in “deep CNN” means many of these stacked conv layers, each one’s features built from the features below it; early layers settle on edges and color blobs, middle layers on textures and parts, deep layers on whole-object templates. None of that hierarchy is specified anywhere in the code; it emerges from training. Knowing this also explains a piece of the parameter-count headlines you see in AI news. A modern vision model might have hundreds of millions of parameters; almost all of them live in the conv (or transformer) blocks, not in fully-connected layers, because conv blocks pay for themselves by detecting the same kinds of patterns everywhere in the image instead of relearning them position-by-position.

A conv layer is the smallest piece of architecture that says “I know this input is an image”; weight sharing is the prior that says “patterns matter anywhere they appear”; and the training loop on top is the same one we already have.