Skip to content

Learning features instead of coding them, neural networks and backprop

This is the Phase 1 capstone (lesson 4 of Foundations for vision). The one capability it builds: you will be able to explain why a neural network with one hidden layer can do what a linear classifier cannot, and you will be able to run a small backpropagation by hand to see how every weight in a network gets its gradient in one efficient pass. With this lesson, the full general-purpose image classifier exists; the rest of the track refines it. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 4 and grounds in the neural-networks-1 and optimization-2 course notes.

The lesson first shows why stacking linear layers alone gains nothing (the composition flattens to one layer), motivates inserting a ReLU between them, defines the two-layer NN forward pass, and explains that the hidden layer’s outputs are learned features of the image. It places that move historically (AlexNet 2012, the end of the hand-engineered-features era). It then explains backpropagation as the chain rule applied recursively through a computational graph, walks the canonical f(x,y,z) = (x+y) · z circuit forward and backward by hand, and closes by assembling the complete training loop: forward, loss, backward, step.

This is lesson 4 of 16, the final lesson of Phase 1. It depends on lessons 2 (the linear classifier we will extend) and 3 (the loss and gradient descent we will now drive through deeper weights). It closes the foundations, after which Phase 2 (How machines see) opens with convolutional neural networks, the architecture built specifically for images. The training loop you assemble here will not change in Phase 2; the forward and backward passes at one specific layer will.

Prerequisites: lessons 2 and 3 of this track. You need lesson 2’s s = W · x + b in your head and lesson 3’s gradient descent step (W ← W - α * ∇L). Neural Network Intuition (Track 11) is helpful soft background: lessons 8 (“What backpropagation is really doing”) and 9 (“Backpropagation and the chain rule”) cover the same backprop story in a generic NN setting with extra step-by-step intuition.

Light, with one small chain-rule pass. The body’s only arithmetic is the worked CS231n circuit f(x,y,z) = (x+y) · z with x = -2, y = 5, z = -4 (forward yields q = 3, f = -12; backward yields [df/dx, df/dy, df/dz] = [-4, -4, 3]). Practice repeats the circuit with fresh numbers and adds a small matrix exercise to verify that stacking two linear layers collapses to one. No calculus beyond multiplying a local derivative by an upstream number is required.

  • Explain why stacking linear layers alone is a no-op
  • Write the two-layer NN forward equations and describe what the hidden layer learns
  • Place the shift from hand-engineered to learned features in its historical context
  • State backpropagation in one sentence and explain the local + upstream gradient pattern at every gate
  • Run a small chain-rule circuit forward and backward by hand
  • Read time: about 13 minutes
  • Practice time: about 15 minutes (a 2-by-2 worked exercise showing two linears collapse, a fresh chain-rule circuit by hand, a short reasoning question about ReLU’s role, plus flashcards)
  • Difficulty: standard (the math is multiplications, additions, and one max; the conceptual lift is seeing why the non-linearity matters and how backprop’s locality lets it scale)