| Year | Architecture | Headline idea | Layers | Parameters | ImageNet top-5 |
|---|
| 2012 | AlexNet | First deep CNN to win ImageNet; ReLU + dropout + 2 GPUs | 8 (5 conv + 3 FC) | ~60M | 16% (vs 26% runner-up) |
| 2014 | VGG-16 | Depth + uniformity (only 3x3 conv + 2x2 pool) | 16 | ~140M | single-digit |
| 2014 | GoogLeNet (Inception) | Inception modules + 1x1 dim reduction + avg-pool head | 22 | ~4M | single-digit |
| 2015 | ResNet | Skip connections (y = F(x) + x) + heavy batch norm | up to 152 | ~25M (ResNet-50), ~60M (ResNet-152) | low single-digit (below human-level ~5%) |
| Idea | What it does | Architecture |
|---|
| ReLU | Faster training; partly addresses vanishing gradients | AlexNet onward |
| Dropout (FC layers) | Regularization; counter overfitting in dense layers | AlexNet onward |
| 3x3 conv stacks | Same receptive field as larger filter, fewer params, more non-linearities | VGG |
| Inception module | Parallel filter sizes (1x1, 3x3, 5x5) + 1x1 dim reduction | GoogLeNet |
| Average-pool head | Replaces giant FC tower, kills millions of params | GoogLeNet |
| Residual block | y = F(x) + x; identity shortcut for gradient flow | ResNet |
| Batch normalization | Stabilizes training; “heavy use” in ResNet | ResNet onward |
| Scenario | Conventional block | Residual block |
|---|
| Block should be identity | Has to LEARN H(x) = x (hard) | Trivial: F(x) = 0 |
| Block should do useful work | Learn H(x) directly | Learn F(x) = H(x) - x |
| Backward pass | Gradient must pass through every layer | Also flows back through + x shortcut, unimpeded |
| Technique | What it does | When to use |
|---|
| Data parallelism | Replicate model on N GPUs, split mini-batch, AllReduce gradients | Default; scales nearly linearly |
| Model parallelism | Split model’s layers across GPUs | Model too big for one GPU’s memory |
| Mixed precision | 16-bit floats with 32-bit accumulation | Almost always; cuts memory + boosts throughput |
| Gradient accumulation | Run several mini-batches per gradient step | When memory limits true batch size |
| LR warmup | Start LR tiny, ramp up over first epochs | Stable training at scale |
| Linear scaling rule | Scale LR proportionally to batch size | When growing batch via data parallelism |
| Era | Hardware | Wall-clock |
|---|
| AlexNet (2012) | 2 GPUs | ~6 days |
| Foundation-scale vision (current) | Hundreds to thousands of accelerators | Weeks |
The gradient descent algorithm is identical; the engineering scaled around it.
| Pitfall | Reality |
|---|
| Parameter count = capability | GoogLeNet (~4M) beat VGG (~140M) on ImageNet, ~35x smaller |
| ”Deeper always better” | Only after residual blocks (ResNet) made depth trainable |
| Skip connections are exotic | They are + x. The shocking thing is how much that one addition unlocks |
| Distributed training = different algorithm | No; same gradient descent, just parallelized engineering |
| Year-by-year ImageNet numbers are precise | Directional trend; small differences depend on training tricks + protocol |
Four architectures define the canon (AlexNet inflection; VGG depth-and-uniformity; GoogLeNet smarter-not-bigger; ResNet skip-connections-make-depth-work); ResNet’s y = F(x) + x is now everywhere, including inside vision transformers and large language models; training-at-scale is parallelism engineering on the same algorithm.