Summary: Loss and optimization

Lesson 2’s classifier has knobs (W and b) and no way to set them. Training turns those knobs by minimizing a single number, the loss, that measures how wrong the current predictions are. Two standard losses dominate: multiclass SVM (the correct class’s score must beat every other by at least 1) and softmax / cross-entropy (negative log probability the model gave to the correct class). Add a regularization penalty (λ * Σ W²) to prefer simpler weights that generalize, and you have a loss to drive down. The procedure that drives it down is gradient descent: compute the steepest-uphill direction (the gradient) and step the opposite way. In practice we use mini-batch gradient descent (SGD), sampling 32 to 256 images per step. That is the training loop, and it is essentially how every classifier in this track learns.

Core ideas

Loss = one number to minimize. Per-image loss L_i, averaged over the training set. SVM: L_i = Σ_{j ≠ y} max(0, s_j - s_y + 1); demands a margin of 1. Softmax / cross-entropy: L_i = -log(p_y) where p_j = exp(s_j) / Σ exp(s_k). Either trains the same linear classifier; modern image classifiers usually use softmax.
Regularization prefers simpler W. Add λ * Σ W² to the data loss. The optimizer then minimizes L = L_data + λ * Σ W². Better generalization to unseen images, the only kind of accuracy that matters.
Gradient descent is the engine. The gradient ∇L is the steepest-uphill direction in W-space; the update rule is W ← W - α * ∇L. Random search gets ~15.5 percent on CIFAR-10; random local search ~21.4 percent; gradient descent is dramatically better. CS231n’s three-strategy ladder.
Learning rate (α) is the step size, and matters as much as direction. Too small: progress crawls. Too large: a step overshoots and the loss climbs (CS231n’s CIFAR-10: 2.20 → 1.65 at a good rate, > 2500 at one too-large).
Mini-batch SGD makes it cheap. Each step uses a small batch (commonly 32 to 256, powers of two for hardware), giving a noisy but unbiased estimate of the full gradient and many more steps per wall-clock minute. Analytic gradients (calculus) compute the true gradient; numerical gradients (finite differences) only act as a sanity check during development.

What changes for you

When you read that a model “trained for ten hours” or “converged after 50,000 steps,” what actually happened was the loss-and-optimization loop running tens of thousands of times: forward pass to get scores, loss to score the prediction, gradient of the loss with respect to every weight, one small step against the gradient, repeat on the next mini-batch. Convergence means the loss flattened, not that the model is provably best. Two runs from different random starts can land at different parameter values. Loss curves in papers are this same L_data + λ * Σ W² plotted over training steps, and a model “overfits” when its training loss keeps dropping while its accuracy on unseen images stops improving (the gap regularization exists to limit). If you came from Neural Network Intuition, this is the same cost-landscape and gradient-descent loop, now applied to a vision classifier rather than the generic network there.

Loss measures wrongness; gradient descent walks downhill on it; SGD does the walk cheaply. That four-step cycle is how learning actually happens.