| Step | What happens |
|---|
| 1. Forward pass | Compute scores s = W · x + b for a mini-batch of images |
| 2. Loss | Compute L_i per image; average over the batch; add regularization |
| 3. Gradient | Compute ∇L with respect to every weight (analytic, via backprop) |
| 4. Step | W ← W - α * ∇L; move to the next mini-batch |
| Loss | Formula (one image) | Intuition |
|---|
| Multiclass SVM (hinge) | L_i = Σ_{j ≠ y} max(0, s_j - s_y + 1) | Correct class must beat every other by at least 1 |
| Softmax / cross-entropy | p_j = exp(s_j) / Σ exp(s_k); L_i = -log(p_y) | Confidence on the correct class (probability interpretation) |
Both train the same s = W · x + b. Modern image classifiers usually use softmax.
| Loss | Computation | Value |
|---|
| SVM | max(0, 0.3-0.8+1) + max(0, -0.4-0.8+1) = 0.5 + 0 | 0.5 |
| Softmax | exp ≈ [2.226, 1.350, 0.670], sum ≈ 4.246, p_X ≈ 0.524, -ln(0.524) | ≈ 0.646 |
Same prediction, same direction; different numbers.
| Term | Form | Knob |
|---|
| L2 (most common) | λ * Σ W² | λ controls weight-shrink strength |
| Total loss | L = L_data + λ * Σ W² | What the optimizer drives down |
Larger λ → smaller W → simpler model → better generalization (within reason).
| Concept | One line |
|---|
Gradient ∇L | Direction of steepest ascent of the loss in W-space |
| Update rule | W ← W - α * ∇L (step in the negative gradient direction) |
Learning rate α | Step size; one of the most consequential hyperparameters |
Too small α | Loss crawls down |
Too large α | Overshoot; loss can climb (CIFAR-10: 2.20 → 1.65 vs > 2500) |
| Strategy | Accuracy |
|---|
| Random search | ~15.5% |
| Random local search | ~21.4% |
| Gradient descent | Dramatically better |
| Method | Pros | Cons | Use |
|---|
| Analytic (calculus) | Fast, exact | Error-prone to derive | Real training |
Numerical (finite diff [L(W+h) - L(W)] / h) | Simple | Approximate, very slow | Gradient check (sanity-check the analytic) |
| Item | Detail |
|---|
| Batch size | Commonly 32, 64, 128, 256 (powers of 2 for GPU efficiency) |
| Why it works | Mini-batch gradient is a noisy but unbiased estimate of the full gradient |
| Strict SGD | Batch size 1; “SGD” in current usage covers all mini-batch sizes |
| Pitfall | Reality |
|---|
| Gradient = step | Gradient is the direction; step length is the learning rate (and we go the OTHER way) |
| Small training loss = good model | What counts is unseen-image accuracy; regularization exists for this gap |
| Bigger learning rate = faster | Past a threshold, it overshoots and loss climbs |
| SVM and softmax are the same | Different formulas, different numbers; softmax is the modern default |
A classifier learns by repeatedly turning its scores into a loss number, computing the steepest-downhill direction in weight space, and taking one small step against it on a mini-batch of images.