Skip to content

How a classifier learns, loss and optimization

This is lesson 3 of Phase 1 (Foundations for vision). The one capability it builds: you will be able to explain and run the training loop, defining the loss that measures wrongness and the optimization step that drives it down. That loop is what every classifier in this track is actually doing under the word “learning.” The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 3 and grounds in the linear-classify and optimization-1 course notes.

The lesson defines two standard losses (multiclass SVM and softmax / cross-entropy), works each on the same scores from lesson 2, adds the regularization term that gives the optimizer a reason to prefer simpler W, and then walks the strategy ladder (random search, random local search, gradient descent) before settling on the gradient descent update rule and how it is realized in practice (analytic gradients via backprop, mini-batch SGD).

This is lesson 3 of 16, the third lesson of Phase 1, and the bridge from a classifier we can compute (lesson 2) to a classifier that can actually learn its own weights. It depends on lesson 2’s s = W · x + b; the next lesson, Learning features instead of coding them: neural networks and backprop, generalizes the linear classifier into a multi-layer network and introduces the backpropagation algorithm that makes the analytic gradient feasible for deep models. Phase 1 closes there, after which Phase 2 introduces convolutional networks built specifically for images.

Prerequisites: lessons 1 and 2 of this track. You need lesson 2’s score function in your head; the loss is a function of those scores, and the gradient is taken with respect to the same W. Neural Network Intuition (Track 11) is helpful soft background, lessons 5-7 there cover the cost landscape and gradient descent in a generic setting; this lesson is the same loop applied to a vision classifier.

A notch more arithmetic than lesson 2, still no calculus required. The body computes one SVM loss by hand (a few max and additions) and one softmax probability + cross-entropy by hand (three exps, a division, a log). The practice section repeats both with fresh numbers and adds a one-step gradient descent update. The gradient itself is introduced as a direction in weight space; how it is actually computed efficiently (the chain rule, backprop) is the next lesson.

  • Write both losses (SVM and softmax / cross-entropy) and compute each on a small case
  • Explain regularization and what λ controls
  • State the gradient descent update rule and what the learning rate α controls
  • Distinguish analytic from numerical gradients
  • Explain mini-batch / SGD and why it is the practical default
  • Read time: about 13 minutes
  • Practice time: about 15 minutes (one fresh SVM-zero case, one softmax + cross-entropy walk-through with two correct-label cases, one gradient descent step at two learning rates, plus flashcards)
  • Difficulty: standard (the math is arithmetic + a couple of exp/log; the conceptual lift is the loss-and-gradient loop)