The architectures that cracked vision, AlexNet to ResNet
What you’ll learn
Section titled “What you’ll learn”This is lesson 6 of Phase 2 (How machines see). The one capability it builds: you will be able to walk the four landmark CNN architectures from 2012 to 2015 in order, name what each contributed, and explain why ResNet’s residual block was the unlock that finally let depth pay off. By the end you have the vocabulary to read any modern vision-architecture paper or product announcement. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 6 (CNN Architectures) and folds in Lecture 11 (Large Scale Distributed Training) as a training-at-scale subsection per the Track 16 Phase 0 arc.
The lesson walks AlexNet (the 2012 inflection, ~60M params, 16 vs 26 percent top-5 error), VGG (2014, ~140M params, 16 weight layers, all 3 by 3 convs), GoogLeNet / Inception (2014, ~4M params, Inception modules + 1 by 1 dimensionality reduction + average-pool head), and ResNet (2015, 152 layers, residual block y = F(x) + x), names the patterns each established, then folds in training-at-scale: data parallelism, model parallelism, mixed precision, learning-rate warmup, and the linear scaling rule, with the AlexNet (2 GPUs, ~6 days) to modern (hundreds to thousands of accelerators, weeks) scale anchor.
Where this fits
Section titled “Where this fits”This is lesson 6 of 16, the second lesson of Phase 2. It depends on lesson 5 (the conv layer; this lesson stacks them into deep architectures). The next lesson, Sequence tools for vision: recurrence and attention, covers RNNs and attention applied to vision tasks (image captioning, video, vision transformers), with the deep architecture mechanics covered in sister tracks.
Before you start
Section titled “Before you start”Prerequisites: lesson 5 of this track (the conv layer). The four architectures are stacks of conv layers (plus a few other pieces); you need the L5 picture in your head. Track 12 lessons 4 and 5 are a useful gentle warm-up for the historical context.
About the math
Section titled “About the math”Light. The body cites parameter counts and top-5 error figures from CS231n’s case-studies section and shows the residual block formula y = F(x) + x. The practice section asks for two short calculations: a parameter-savings ratio (140M / 4M ≈ 35x) and a relative-error reduction (10 / 26 ≈ 38%). No new math operations beyond multiplication and division.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Name the four landmarks in order and their key structural ideas
- Cite the canonical CS231n numbers for parameter counts and AlexNet’s inflection
- Write and explain the residual block formula
- Distinguish data parallelism from model parallelism
- See why parameter count is not the same as capability
Time and difficulty
Section titled “Time and difficulty”- Read time: about 14 minutes
- Practice time: about 15 minutes (a match-the-architecture exercise, a parameter-savings + error-reduction calculation, a residual-block reasoning question, plus flashcards)
- Difficulty: standard (the math is multiplication and division; the conceptual lift is connecting four architectures to the patterns they established and seeing what ResNet’s
+ xunlocked)