Summary: Neural networks and backpropagation

Lesson 2 capped us at one template per class; lesson 3 gave us a training loop. This lesson closes Phase 1 by lifting the cap. We add a non-linearity (commonly ReLU) between two linear layers; the result is a neural network whose hidden layer produces learned features of the image, broadly the move that ended the hand-engineered-features era of computer vision. The catch is computing gradients through all those layers efficiently, which is where backpropagation comes in: the chain rule applied recursively through the network’s computational graph, giving gradients for every weight in one forward plus one backward pass. With this in hand, we have the full general-purpose image classifier the rest of the track refines.

Core ideas

Stacking linears alone is a no-op. W2 · (W1 · x + b1) + b2 = (W2 · W1) · x + (W2 · b1 + b2) = W' · x + b'. To gain capacity, you need a non-linearity between layers.
Two-layer NN: h = ReLU(W1 · x + b1), s = W2 · h + b2. The hidden vector h (size H, a design choice) is a set of learned features; the final linear layer maps those features into K class scores. Multi-modal classes can now spread across many hidden units instead of being squashed into one template.
Learned features ended the feature-engineering era. Pre-2012 CV largely built features by hand (SIFT, HOG, color histograms). AlexNet’s 2012 ImageNet win made learned features dominant; the field’s labor moved from feature design to architecture, data, and training.
Backpropagation is the chain rule applied to a computational graph. Every node (“gate”) on the forward pass computes its output + local gradient; on the backward pass it receives an upstream gradient and multiplies by its local gradient to produce gradients on its inputs. Each node operates completely locally; the chain rule glues them together.
One forward pass + one backward pass = gradients for every weight, at once. That efficiency is why backprop replaced numerical (finite-difference) gradients, which cost one forward pass per weight (hopeless for billions).
The full training loop is now complete: forward → loss → backward (backprop) → step W ← W - α * ∇L. Repeat on the next mini-batch. Every classifier in this track, including the giant ones, runs this same cycle.

What changes for you

When you read about a model with “150 layers” or “8 billion parameters trained for three weeks,” you can now picture exactly what happened. There is no exotic mechanism inside; there is the same training loop running enormously many times, with backprop carrying gradients through every layer on every step. “Deep” means many of these stacked layers, each adding more learned-feature capacity. The fact that features are learned, not designed, also explains how modern computer-vision work spends almost no effort hand-crafting features and almost all of it on architecture, data, and training procedure. It also explains the field’s odd failure modes: the features a model relies on were never specified by anyone, so when a model fails strangely, it is often because the learned features matched something they should not have. If you came from Neural Network Intuition, lessons 6-9 there cover the same cost landscape, gradient descent, and backpropagation in a generic setting; this lesson is them applied to a vision classifier.

A non-linearity between linear layers lifts the one-template-per-class limit; the chain rule gets the gradient through every weight in one backward pass. With those two moves Phase 1 is closed and a general-purpose image classifier exists, ready for the special-purpose architecture of Phase 2.