Skip to content

Cheatsheet: CNN architectures and training at scale

The four landmark architectures (2012-2015)

Section titled “The four landmark architectures (2012-2015)”
YearArchitectureHeadline ideaLayersParametersImageNet top-5
2012AlexNetFirst deep CNN to win ImageNet; ReLU + dropout + 2 GPUs8 (5 conv + 3 FC)~60M16% (vs 26% runner-up)
2014VGG-16Depth + uniformity (only 3x3 conv + 2x2 pool)16~140Msingle-digit
2014GoogLeNet (Inception)Inception modules + 1x1 dim reduction + avg-pool head22~4Msingle-digit
2015ResNetSkip connections (y = F(x) + x) + heavy batch normup to 152~25M (ResNet-50), ~60M (ResNet-152)low single-digit (below human-level ~5%)
IdeaWhat it doesArchitecture
ReLUFaster training; partly addresses vanishing gradientsAlexNet onward
Dropout (FC layers)Regularization; counter overfitting in dense layersAlexNet onward
3x3 conv stacksSame receptive field as larger filter, fewer params, more non-linearitiesVGG
Inception moduleParallel filter sizes (1x1, 3x3, 5x5) + 1x1 dim reductionGoogLeNet
Average-pool headReplaces giant FC tower, kills millions of paramsGoogLeNet
Residual blocky = F(x) + x; identity shortcut for gradient flowResNet
Batch normalizationStabilizes training; “heavy use” in ResNetResNet onward
ScenarioConventional blockResidual block
Block should be identityHas to LEARN H(x) = x (hard)Trivial: F(x) = 0
Block should do useful workLearn H(x) directlyLearn F(x) = H(x) - x
Backward passGradient must pass through every layerAlso flows back through + x shortcut, unimpeded
TechniqueWhat it doesWhen to use
Data parallelismReplicate model on N GPUs, split mini-batch, AllReduce gradientsDefault; scales nearly linearly
Model parallelismSplit model’s layers across GPUsModel too big for one GPU’s memory
Mixed precision16-bit floats with 32-bit accumulationAlmost always; cuts memory + boosts throughput
Gradient accumulationRun several mini-batches per gradient stepWhen memory limits true batch size
LR warmupStart LR tiny, ramp up over first epochsStable training at scale
Linear scaling ruleScale LR proportionally to batch sizeWhen growing batch via data parallelism
EraHardwareWall-clock
AlexNet (2012)2 GPUs~6 days
Foundation-scale vision (current)Hundreds to thousands of acceleratorsWeeks

The gradient descent algorithm is identical; the engineering scaled around it.

PitfallReality
Parameter count = capabilityGoogLeNet (~4M) beat VGG (~140M) on ImageNet, ~35x smaller
”Deeper always better”Only after residual blocks (ResNet) made depth trainable
Skip connections are exoticThey are + x. The shocking thing is how much that one addition unlocks
Distributed training = different algorithmNo; same gradient descent, just parallelized engineering
Year-by-year ImageNet numbers are preciseDirectional trend; small differences depend on training tricks + protocol

Four architectures define the canon (AlexNet inflection; VGG depth-and-uniformity; GoogLeNet smarter-not-bigger; ResNet skip-connections-make-depth-work); ResNet’s y = F(x) + x is now everywhere, including inside vision transformers and large language models; training-at-scale is parallelism engineering on the same algorithm.