Summary: CNN architectures and training at scale
Four landmark architectures define the modern computer-vision canon, all winning or contending at ImageNet between 2012 and 2015. AlexNet (2012, ~60M params) was the inflection that ended hand-engineered features (16 percent top-5 error vs the runner-up’s 26 percent). VGG (2014, ~140M params, 16 weight layers, all 3 by 3 convs) showed depth + uniformity helps but at a parameter cost. GoogLeNet / Inception (2014, ~4M params, Inception modules + 1 by 1 dimensionality reduction + no FC tower) showed smarter beats bigger. ResNet (2015, 152 layers, residual blocks y = F(x) + x) showed how to make depth actually work via identity shortcuts that let gradients flow through arbitrarily deep networks. Skip connections are now everywhere in deep learning. The folded subsection on training at scale: same gradient descent algorithm, expanded engineering (data parallelism, model parallelism, mixed precision, learning-rate warmup).
Core ideas
Section titled “Core ideas”- AlexNet (2012). ~60M params, 5 conv + 3 FC layers. Innovations were ReLU, dropout, training on 2 GPUs for ~6 days. ImageNet top-5 16 percent vs 26 percent runner-up; the result that ended hand-engineered features and started the deep-learning era of CV.
- VGG (2014). ~140M params, 16 weight layers. Idea: stack many small 3 by 3 convs uniformly throughout. A stack of small filters covers the same receptive field as one large one, with more non-linearities and fewer per-layer parameters. Cost: enormous FC tower at the top.
- GoogLeNet / Inception (2014). ~4M params (~35x smaller than VGG), with comparable or better accuracy. Inception modules combine multiple filter sizes in parallel; 1 by 1 convs cheaply reduce input depth before expensive 3x3 / 5x5 branches; average pooling at the top eliminates the FC parameter cost. Smarter beats bigger.
- ResNet (2015). 152 layers (an order of magnitude deeper than VGG). Residual block:
y = F(x) + x, whereFis the residual function the conv layers inside compute and+ xis the identity shortcut. Solves the vanishing-gradient and optimization-difficulty problems that had capped depth; gradients flow back through the shortcut. ResNet-style skip connections are now in transformers, language models, and everywhere depth matters. - Patterns settled by the four: depth helps only when you can train it (ResNet’s contribution); parameter count is not capability (GoogLeNet beat VGG with 1/35 the parameters); small structural ideas (ReLU, dropout, 1 by 1 conv, identity shortcut) have large effects.
Training at scale (CS231n Lec 11, folded)
Section titled “Training at scale (CS231n Lec 11, folded)”These architectures need real engineering to actually train. Data parallelism replicates the same model on every GPU, splits the mini-batch, lets each GPU compute its own gradient, and averages gradients across GPUs (AllReduce); scales nearly linearly and is the default for vision. Model parallelism splits the model’s layers across GPUs when the model itself is too big for one GPU’s memory; more complex, reserved for very large models. Production also uses mixed-precision training (16-bit floats with 32-bit accumulation), gradient accumulation (simulate larger effective batches when memory is tight), learning-rate warmup, and the linear scaling rule (scale LR proportionally to batch size). AlexNet was 2 GPUs for ~6 days; foundation-scale vision models today run on hundreds to thousands of accelerators for weeks. The algorithm (W ← W - α * ∇L) is unchanged; the engineering expanded around it.
What changes for you
Section titled “What changes for you”When you read about a vision model in 2026, the residual block from ResNet is almost always somewhere in the forward pass, including inside the transformer blocks of vision transformers and large language models. The “X-billion-parameter” headline numbers are mostly the size of these stacked residual-style blocks; the “trained for Y weeks on Z GPUs” headlines are mostly the cost of the data-parallel loop running at scale. None of this is exotic once you can name the moving pieces. Knowing the four landmarks also gives you the vocabulary to read modern architectures: “ConvNeXt,” “EfficientNet variant,” “ViT-this-size” all position themselves relative to ResNet’s baseline and structural idioms.
Stack conv layers with skip connections and you can build a network 152 layers deep that still trains; that is the unlock of Phase 2.