Skip to content

Cheatsheet: Convolution and CNNs

ProblemDetail
Parameter explosionSingle FC neuron on 224x224x3 input = 150,528 weights; 100 hidden units = 15M+ params
No spatial priorTreats every pixel-pair as unrelated; can’t reuse a “cat detector” across positions
ElementDetail
Filter / kernelSmall spatially (3, 5, 7), full depth of input (e.g. 3 for RGB)
Operation per positionDot product of filter with the local input patch
Output per filterA 2D feature (activation) map showing where the pattern occurred
Many filters (K)Output is a 3D volume of depth K, one feature map per filter
Non-linearity (next step)ReLU applied elementwise after conv (typical convention)
SymbolMeaningEffect
K (depth)Number of filtersSets depth of output volume
S (stride)Pixels per slideLarger S shrinks output spatially
P (padding)Zero rings around borderTune to control output spatial size

output_size = (W - F + 2P) / S + 1

WFSPOutput
3251232 (“same” padding)
224311224 (“same” padding)
53103 (body’s edge-detector case)
53202 (stride 2 downsampling)
73105 (CS231n verbatim)
73203 (CS231n verbatim)

Result must be a whole number.

weights = K * F * F * D_in; biases = K

SetupParams
100 filters of 3x3, RGB input100 * (333) + 100 = 2,800
96 filters of 11x11x3 (AlexNet 1st layer)96 * (11113) + 96 = 34,944
FC layer, 100 units, 32x32x3 input (for comparison)100 * 3072 + 100 = 307,300
FC layer, 100 units, 224x224x3 input (for comparison)100 * 150,528 + 100 = 15,052,900

Conv parameter count does NOT depend on input image size.

PropertyWhy
Same filter weights at every spatial positionPatterns useful at one position are useful at any (translational structure of images)
Fewer parametersOne filter per pattern, not one filter per pattern per position
Translation equivarianceShift the input, output feature maps shift by the same amount, automatically
WhatWhy
LossSVM or softmax / cross-entropy, on the final classifier’s scores
BackpropChain rule through every layer (including conv layers)
Gradient descent stepW ← W - α * ∇L for every filter weight
Four-step training loopForward, loss, backward, step (lesson 4)
PitfallReality
Convolution = Photoshop filterMath is similar, but CNN filters are LEARNED by backprop, not hand-designed
Filter sees the whole imageEach filter sees a small local patch; deeper layers grow effective receptive field
Output size is a choiceDetermined by (W - F + 2P) / S + 1; only certain combinations work
Forgetting input depth3x3 filter on RGB = 27 weights (333), not 9; spatial small but depth full

A convolution is a small learned filter doing the same dot product everywhere on the image; weight sharing makes it pay for itself in parameters and translation equivariance; the training loop on top is unchanged.