Practice: Neural networks and backpropagation
Self-check
Section titled “Self-check”Seven short questions. Answer each before opening the collapsible.
1. Why does stacking two linear layers (without anything between them) gain no capacity?
Show answer
W2 · (W1 · x + b1) + b2 = (W2 · W1) · x + (W2 · b1 + b2) = W' · x + b'. The composition is just another single linear layer with different weights and bias. No new representational power.
2. Write the two-layer neural network’s forward equations for image classification.
Show answer
h = ReLU(W1 · x + b1) (the hidden layer of H learned-feature numbers), and s = W2 · h + b2 (the final linear classifier producing K class scores). ReLU(z) = max(0, z) is the non-linearity that makes the extra layer actually do something.
3. In one sentence, what does the hidden layer actually produce?
Show answer
A vector of H learned features of the image. Each hidden unit looks at all input pixels through its own row of W1, computes a weighted sum, and ReLU keeps it if positive (zeroes it otherwise). The final classifier sees those H feature numbers instead of the raw pixels.
4. What changed in computer vision around 2012 that made deep networks dominate?
Show answer
Pre-2012, “doing CV” largely meant hand-engineering features (SIFT, HOG, color histograms) and feeding them to a (often linear) classifier. AlexNet’s 2012 ImageNet win showed that a network learning its own features could outperform hand-designed ones by a wide margin. The feature-engineering era effectively ended; learned features became the default.
5. State backpropagation in one sentence.
Show answer
Backpropagation is the chain rule of calculus applied recursively through the computational graph of the network: one forward pass computes outputs (and stores local gradients), then one backward pass sweeps backward from the loss, multiplying upstream gradients by local gradients at each node, producing gradients for every weight at once.
6. At a single computational node (“gate”), what two things does it do on the backward pass?
Show answer
It receives an upstream gradient (the loss’s sensitivity to its own output) and multiplies that by its local gradient (its output’s sensitivity to each input), producing the gradient to send back to each of its inputs. It does this completely locally; it does not need to know anything about the rest of the network.
7. Why is backprop the way to compute gradients (rather than the numerical / finite-difference approach)?
Show answer
The numerical approach (nudge a weight, re-run the forward pass, see how loss changed) costs one forward pass per weight, which is hopeless for billions of weights. Backprop computes the gradient for every weight in one forward pass plus one backward pass total, regardless of how many weights the network has. Efficiency is the whole reason it exists.
Try it yourself: collapse two linears, then run a fresh chain-rule circuit
Section titled “Try it yourself: collapse two linears, then run a fresh chain-rule circuit”Three short exercises, about 15 minutes.
Part A: linears collapse without a non-linearity. A toy 2-input, 2-hidden, 2-output network has:
W1 = | 1 2 | b1 = | 0 | | 0 -1 | | 1 |
W2 = | 2 1 | b2 = | 0 | | -1 0 | | -1 |(1) Without any activation in between, compute the equivalent single-layer weights W' = W2 · W1 and bias b' = W2 · b1 + b2. (2) On the input x = [1, 2], compute the output using the two-layer form and using the collapsed single-layer form; verify they match. (3) Now insert a ReLU between the layers and compute the output again; observe that it differs.
Worked answer
(1) Collapse.
W' = W2 · W1 = | 2·1 + 1·0 2·2 + 1·(-1) | = | 2 3 | | -1·1 + 0·0 -1·2 + 0·(-1) | | -1 -2 |
b' = W2 · b1 + b2 = | 2·0 + 1·1 | + | 0 | = | 1 | | -1·0 + 0·1 | | -1 | | -1 |So the equivalent single layer is W' = [[2, 3], [-1, -2]], b' = [1, -1].
(2) Two paths, same answer (no non-linearity). Two-layer form:
W1 · x + b1 = [1·1 + 2·2, 0·1 + (-1)·2] + [0, 1] = [5, -2] + [0, 1] = [5, -1]W2 · (...) = [2·5 + 1·(-1), -1·5 + 0·(-1)] = [9, -5]+ b2 = [9, -5] + [0, -1] = [9, -6]Collapsed form: W' · x + b' = [2·1 + 3·2, -1·1 + (-2)·2] + [1, -1] = [8, -5] + [1, -1] = [9, -6]. Match. Without the non-linearity, the two-layer model is exactly the one-layer model.
(3) With ReLU between. W1 · x + b1 = [5, -1]; ReLU([5, -1]) = [5, 0]. Then W2 · [5, 0] + b2 = [2·5 + 1·0, -1·5 + 0·0] + [0, -1] = [10, -5] + [0, -1] = [10, -6].
The output [10, -6] differs from the no-non-linearity result [9, -6]. ReLU’s kink at zero changed what the network can express. That kink is what gives the two-layer NN real capacity that one linear layer cannot match.
Part B: a fresh chain-rule circuit. Take CS231n’s circuit f(x, y, z) = (x + y) * z again, but now with x = 3, y = -1, z = 2. Run the forward pass and then the backward pass to find df/dx, df/dy, df/dz.
Worked answer
Forward. q = x + y = 3 + (-1) = 2. f = q * z = 2 * 2 = 4.
Backward. Start with df/df = 1.
Last gate is the multiplier f = q * z. Local gradients are ∂f/∂q = z = 2 and ∂f/∂z = q = 2. Multiply by upstream (1):
df/dq = 1 * 2 = 2 (sent backward to q)df/dz = 1 * 2 = 2 (sent backward to z; done)Previous gate is the adder q = x + y. Local gradients are both 1. Multiply by upstream (df/dq = 2):
df/dx = 2 * 1 = 2df/dy = 2 * 1 = 2Final: [df/dx, df/dy, df/dz] = [2, 2, 2].
Sanity-check by finite difference: f(3.01, -1, 2) = (3.01 - 1) * 2 = 4.02; change is 0.02 = 2 * 0.01, matching df/dx = 2. Same backward recipe, scaled to thousands of gates, is how a deep network gets all its gradients.
Part C: reasoning. Argue in one or two sentences why inserting a ReLU between two linear layers gives genuinely new capacity, in a way that increasing the size of a single linear layer never could.
What a good answer looks like
A single linear map (no matter how wide) corresponds to a single flat boundary per class in input space; multi-modal classes still get one merged template (the lesson 2 limit). Inserting a ReLU creates a kink: different regions of input space can now produce different linear behaviors downstream, so the network can express different combinations of features for different parts of the input. Stacking more such (linear + ReLU) blocks compounds this, which is why depth + non-linearity buys capacity that width alone never does.
Flashcards
Section titled “Flashcards”Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.
Q. Why does stacking two linear layers with nothing between them not add capacity?
W2 · (W1 · x + b1) + b2 = (W2 · W1) · x + (W2 · b1 + b2), which is just another single linear layer with different weights and bias. No representational gain.
Q. Two-layer NN forward equations?
h = ReLU(W1 · x + b1) (hidden, H learned features) and s = W2 · h + b2 (final linear, K class scores). ReLU(z) = max(0, z) is the non-linearity.
Q. What does the hidden layer produce?
A vector of H learned features of the image. The final classifier sees those H feature numbers instead of raw pixels, which breaks the one-template-per-class limit.
Q. What changed in CV in 2012?
AlexNet showed networks learning their own features outperformed hand-engineered ones (SIFT, HOG, color histograms). The feature-engineering era effectively ended; learned features became the default.
Q. Backpropagation, in one sentence?
The chain rule of calculus applied recursively through the network’s computational graph: one forward pass + one backward pass produces gradients for every weight at once.
Q. What does a single node do on the backward pass?
Receives an upstream gradient (loss’s sensitivity to its output), multiplies it by its local gradient (output’s sensitivity to each input), and sends those gradients backward to its inputs. Completely local; node knows nothing of the rest of the network.
Q. Why is backprop efficient compared to numerical (finite-difference) gradients?
Numerical: O(N) forward passes for N weights (hopeless for billions). Backprop: one forward + one backward pass total, regardless of how many weights.
Q. What is the four-step training loop for a neural network?
(1) Forward pass to get scores; (2) compute loss; (3) backward pass (backprop) to get gradients for every weight; (4) step W ← W - α * ∇L. Repeat on the next mini-batch.
Q. What does 'deep' mean in deep learning?
Many stacked layers (linear, then non-linearity, then linear, …). Each layer’s features are built from the features below it. Depth = composition; the mechanism is plain, the power is in stacking.