| Without non-linearity | With ReLU between |
|---|
W2 · (W1·x + b1) + b2 = W' · x + b' | W2 · ReLU(W1·x + b1) + b2 does NOT collapse |
| Equivalent to a single linear layer | Genuinely new capacity per added layer |
A non-linearity (ReLU, sigmoid, tanh, …) between linear layers is what makes a network a network.
| Symbol | Meaning |
|---|
x | Image as a column vector (e.g. 3072 numbers for CIFAR-10) |
W1 [H × D] | First layer weights (D pixels, H hidden units) |
b1 [H] | First layer biases |
ReLU(z) | max(0, z), elementwise |
h = ReLU(W1·x + b1) | Hidden layer: H learned features |
W2 [K × H] | Final classifier weights |
b2 [K] | Final classifier biases |
s = W2·h + b2 | K class scores |
| In a linear classifier | In a 2-layer NN |
|---|
| One template per class | Many learned features per class (broken multi-modal limit) |
| Single hyperplane per class in pixel space | Many hyperplanes in hidden-feature space, combined non-linearly |
| Era | What got designed | What got learned |
|---|
| Pre-2012 | Features (SIFT, HOG, color histograms) by hand | Classifier weights only |
| Post-2012 (AlexNet on) | Architecture, data, training procedure | Features + classifier weights, end-to-end |
| Pass | What happens at every node (“gate”) |
|---|
| Forward | Compute output and remember local gradient(s) ∂output/∂input |
| Backward | Receive upstream gradient; multiply by local; send result to each input |
| Net effect | One forward + one backward pass yields gradients for EVERY weight |
Each node is completely local: it does not know the rest of the network.
| Inputs | Forward | Backward |
|---|
x=-2, y=5, z=-4 | q=3, f=-12 | df/dx=-4, df/dy=-4, df/dz=3 |
x=3, y=-1, z=2 | q=2, f=4 | df/dx=2, df/dy=2, df/dz=2 |
Same recipe scaled to thousands of gates is what backprop in a network does.
| Method | Cost for N weights | Used for |
|---|
| Numerical (finite differences) | O(N) forward passes | Gradient check only (sanity-check the analytic) |
| Backprop (analytic, chain rule) | 1 forward + 1 backward | Real training |
| Step | What |
|---|
| 1. Forward pass | h = ReLU(W1·x + b1); s = W2·h + b2 |
| 2. Loss | SVM or softmax/cross-entropy + regularization |
| 3. Backward pass | Backprop the chain rule: gradients for W1, b1, W2, b2 |
| 4. Step | W ← W - α * ∇L for every weight |
Repeat on the next mini-batch. Every classifier in this track runs this cycle.
| Pitfall | Reality |
|---|
| Deep means “smarter per layer” | Each layer is linear + non-linearity; depth = composition |
| Skip the non-linearity | Two linears without it = one linear; the extra layer is a no-op |
| Backprop is magic | It is the chain rule applied locally at every gate, in reverse order |
| Mixing up forward/backward | Forward: compute outputs + cache. Backward: compute gradients using the cache |
Add a non-linearity between linear layers so the hidden layer can learn features, then use backprop (the chain rule applied through the graph in one backward pass) to get the gradient for every weight at once.