Telling pictures apart with one score, linear classifiers
What you’ll learn
Section titled “What you’ll learn”This is lesson 2 of Phase 1 (Foundations for vision). Lesson 1 made the case that vision must be data-driven; this lesson is the simplest concrete machine that carries that out. The one capability it builds: you will be able to compute a linear-classifier prediction by hand and explain exactly what each piece of s = W · x + b is doing. That equation is the seed every later vision model grows from, and the final layer of nearly every modern vision model is still of this same form. The source curriculum is Stanford CS231n, cs231n.stanford.edu.
The lesson defines the score function, grounds it in CIFAR-10’s shapes (an image as 3072 pixel numbers, ten classes, W as a 10-by-3072 matrix), walks one small prediction step by step, shows that each row of W is a learned template for one class (and visualizing CIFAR-10’s actual learned templates produces the famously ghostly two-headed horse), explains the geometric hyperplane view, and ends on the structural limit, one template per class, that motivates everything that follows.
Where this fits
Section titled “Where this fits”This is lesson 2 of 16, and the second lesson of Phase 1. It depends directly on lesson 1’s “data-driven approach” framing: the linear classifier is the simplest learner that approach produces. The next lesson, How a classifier learns: loss and optimization, defines exactly what “predictions match labels” means as a single number (the loss) and shows how to nudge W and b to make it smaller. Phase 1 closes with neural networks and backpropagation, after which Phase 2 introduces the convolutional networks that finally break the multi-modal limit named here.
Before you start
Section titled “Before you start”Prerequisites: lesson 1 of this track (Why seeing is hard for machines), which sets up the data-driven approach this lesson realizes. Neural Network Intuition (Track 11) is helpful soft background, the per-neuron w · x + b from its lesson 3 is the same computation done per-class here.
About the math
Section titled “About the math”Light, but more arithmetic than lesson 1. The only operations are multiplying pairs of numbers and adding them up (a dot product), plus checking matrix shapes (K-by-D times D-by-1 = K-by-1). The body works one tiny prediction by hand, the practice section walks you through another with different numbers, and a parameter-counting exercise multiplies pixel-count times class-count. Nothing beyond arithmetic and shape-bookkeeping is required.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Write the score function
s = W · x + band name what each symbol is, with CIFAR-10 shapes - Compute a small linear-classifier prediction by hand
- Explain why each row of W is a learned per-class template
- Describe the geometric (hyperplane) view and the bias’s role in it
- Identify the one-template-per-class limit and explain why it motivates the lessons ahead
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 15 minutes (a fresh worked dot-product prediction, parameter-counting arithmetic at three scales, a multi-modal reasoning question, plus flashcards)
- Difficulty: standard (the math is multiplication and addition; the conceptual jump is the template interpretation and seeing the limit clearly)