Skip to content

Cheatsheet: Loss and optimization

StepWhat happens
1. Forward passCompute scores s = W · x + b for a mini-batch of images
2. LossCompute L_i per image; average over the batch; add regularization
3. GradientCompute ∇L with respect to every weight (analytic, via backprop)
4. StepW ← W - α * ∇L; move to the next mini-batch
LossFormula (one image)Intuition
Multiclass SVM (hinge)L_i = Σ_{j ≠ y} max(0, s_j - s_y + 1)Correct class must beat every other by at least 1
Softmax / cross-entropyp_j = exp(s_j) / Σ exp(s_k); L_i = -log(p_y)Confidence on the correct class (probability interpretation)

Both train the same s = W · x + b. Modern image classifiers usually use softmax.

Worked numbers (scores [0.8, 0.3, -0.4], correct = X)

Section titled “Worked numbers (scores [0.8, 0.3, -0.4], correct = X)”
LossComputationValue
SVMmax(0, 0.3-0.8+1) + max(0, -0.4-0.8+1) = 0.5 + 00.5
Softmaxexp ≈ [2.226, 1.350, 0.670], sum ≈ 4.246, p_X ≈ 0.524, -ln(0.524)≈ 0.646

Same prediction, same direction; different numbers.

TermFormKnob
L2 (most common)λ * Σ W²λ controls weight-shrink strength
Total lossL = L_data + λ * Σ W²What the optimizer drives down

Larger λ → smaller W → simpler model → better generalization (within reason).

ConceptOne line
Gradient ∇LDirection of steepest ascent of the loss in W-space
Update ruleW ← W - α * ∇L (step in the negative gradient direction)
Learning rate αStep size; one of the most consequential hyperparameters
Too small αLoss crawls down
Too large αOvershoot; loss can climb (CIFAR-10: 2.20 → 1.65 vs > 2500)
StrategyAccuracy
Random search~15.5%
Random local search~21.4%
Gradient descentDramatically better
MethodProsConsUse
Analytic (calculus)Fast, exactError-prone to deriveReal training
Numerical (finite diff [L(W+h) - L(W)] / h)SimpleApproximate, very slowGradient check (sanity-check the analytic)
ItemDetail
Batch sizeCommonly 32, 64, 128, 256 (powers of 2 for GPU efficiency)
Why it worksMini-batch gradient is a noisy but unbiased estimate of the full gradient
Strict SGDBatch size 1; “SGD” in current usage covers all mini-batch sizes
PitfallReality
Gradient = stepGradient is the direction; step length is the learning rate (and we go the OTHER way)
Small training loss = good modelWhat counts is unseen-image accuracy; regularization exists for this gap
Bigger learning rate = fasterPast a threshold, it overshoots and loss climbs
SVM and softmax are the sameDifferent formulas, different numbers; softmax is the modern default

A classifier learns by repeatedly turning its scores into a loss number, computing the steepest-downhill direction in weight space, and taking one small step against it on a mini-batch of images.