Skip to content

Cheatsheet: Neural networks and backpropagation

Without non-linearityWith ReLU between
W2 · (W1·x + b1) + b2 = W' · x + b'W2 · ReLU(W1·x + b1) + b2 does NOT collapse
Equivalent to a single linear layerGenuinely new capacity per added layer

A non-linearity (ReLU, sigmoid, tanh, …) between linear layers is what makes a network a network.

SymbolMeaning
xImage as a column vector (e.g. 3072 numbers for CIFAR-10)
W1 [H × D]First layer weights (D pixels, H hidden units)
b1 [H]First layer biases
ReLU(z)max(0, z), elementwise
h = ReLU(W1·x + b1)Hidden layer: H learned features
W2 [K × H]Final classifier weights
b2 [K]Final classifier biases
s = W2·h + b2K class scores
In a linear classifierIn a 2-layer NN
One template per classMany learned features per class (broken multi-modal limit)
Single hyperplane per class in pixel spaceMany hyperplanes in hidden-feature space, combined non-linearly
EraWhat got designedWhat got learned
Pre-2012Features (SIFT, HOG, color histograms) by handClassifier weights only
Post-2012 (AlexNet on)Architecture, data, training procedureFeatures + classifier weights, end-to-end
PassWhat happens at every node (“gate”)
ForwardCompute output and remember local gradient(s) ∂output/∂input
BackwardReceive upstream gradient; multiply by local; send result to each input
Net effectOne forward + one backward pass yields gradients for EVERY weight

Each node is completely local: it does not know the rest of the network.

Worked chain-rule circuit (f(x,y,z) = (x+y)·z)

Section titled “Worked chain-rule circuit (f(x,y,z) = (x+y)·z)”
InputsForwardBackward
x=-2, y=5, z=-4q=3, f=-12df/dx=-4, df/dy=-4, df/dz=3
x=3, y=-1, z=2q=2, f=4df/dx=2, df/dy=2, df/dz=2

Same recipe scaled to thousands of gates is what backprop in a network does.

MethodCost for N weightsUsed for
Numerical (finite differences)O(N) forward passesGradient check only (sanity-check the analytic)
Backprop (analytic, chain rule)1 forward + 1 backwardReal training
StepWhat
1. Forward passh = ReLU(W1·x + b1); s = W2·h + b2
2. LossSVM or softmax/cross-entropy + regularization
3. Backward passBackprop the chain rule: gradients for W1, b1, W2, b2
4. StepW ← W - α * ∇L for every weight

Repeat on the next mini-batch. Every classifier in this track runs this cycle.

PitfallReality
Deep means “smarter per layer”Each layer is linear + non-linearity; depth = composition
Skip the non-linearityTwo linears without it = one linear; the extra layer is a no-op
Backprop is magicIt is the chain rule applied locally at every gate, in reverse order
Mixing up forward/backwardForward: compute outputs + cache. Backward: compute gradients using the cache

Add a non-linearity between linear layers so the hidden layer can learn features, then use backprop (the chain rule applied through the graph in one backward pass) to get the gradient for every weight at once.