CNN architectures, in brief

What you’ll learn

This is lesson 6 of Phase 2 (How machines see). The one capability it builds: you will be able to walk the four landmark CNN architectures from 2012 to 2015 in order, name what each contributed, and explain why ResNet’s residual block was the unlock that finally let depth pay off. By the end you have the vocabulary to read any modern vision-architecture paper or product announcement. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 6 (CNN Architectures) and folds in Lecture 11 (Large Scale Distributed Training) as a training-at-scale subsection per the Track 16 Phase 0 arc.

The lesson walks AlexNet (the 2012 inflection, ~60M params, 16 vs 26 percent top-5 error), VGG (2014, ~140M params, 16 weight layers, all 3 by 3 convs), GoogLeNet / Inception (2014, ~4M params, Inception modules + 1 by 1 dimensionality reduction + average-pool head), and ResNet (2015, 152 layers, residual block y = F(x) + x), names the patterns each established, then folds in training-at-scale: data parallelism, model parallelism, mixed precision, learning-rate warmup, and the linear scaling rule, with the AlexNet (2 GPUs, ~6 days) to modern (hundreds to thousands of accelerators, weeks) scale anchor.

Where this fits

This is lesson 6 of 16, the second lesson of Phase 2. It depends on lesson 5 (the conv layer; this lesson stacks them into deep architectures). The next lesson, Sequence tools for vision: recurrence and attention, covers RNNs and attention applied to vision tasks (image captioning, video, vision transformers), with the deep architecture mechanics covered in sister tracks.

Before you start

Prerequisites: lesson 5 of this track (the conv layer). The four architectures are stacks of conv layers (plus a few other pieces); you need the L5 picture in your head. Track 12 lessons 4 and 5 are a useful gentle warm-up for the historical context.

About the math

Light. The body cites parameter counts and top-5 error figures from CS231n’s case-studies section and shows the residual block formula y = F(x) + x. The practice section asks for two short calculations: a parameter-savings ratio (140M / 4M ≈ 35x) and a relative-error reduction (10 / 26 ≈ 38%). No new math operations beyond multiplication and division.

By the end, you’ll be able to

Name the four landmarks in order and their key structural ideas
Cite the canonical CS231n numbers for parameter counts and AlexNet’s inflection
Write and explain the residual block formula
Distinguish data parallelism from model parallelism
See why parameter count is not the same as capability

Time and difficulty

Read time: about 14 minutes
Practice time: about 15 minutes (a match-the-architecture exercise, a parameter-savings + error-reduction calculation, a residual-block reasoning question, plus flashcards)
Difficulty: standard (the math is multiplication and division; the conceptual lift is connecting four architectures to the patterns they established and seeing what ResNet’s + x unlocked)