Summary: Linear classifiers
The simplest data-driven vision system in one equation: flatten the image into a column of numbers x, multiply by a learned weight matrix W, add a learned bias vector b, and read off one score per class. The highest score wins. Each row of W is a learned template for one class; the score is a dot product measuring how well the image matches that template. It is template-matching where the templates were learned, not drawn. It works, with limits, and the rest of the track is about getting past those limits.
Core ideas
Section titled “Core ideas”- The score function:
s = W · x + b. For CIFAR-10: x is [3072 by 1] (32 × 32 × 3), W is [10 by 3072], b is [10 by 1], s is [10 by 1]. Pick the largest score. - W rows are learned templates. A row, reshaped back into image dimensions, is the brightness pattern the class likes. Scoring is a dot product; this is template-matching with learned templates, the precise mechanical form of the “learn from examples” shift from lesson 1.
- The bias is a per-class default lean. A small offset added to every score regardless of the input; it lets a class start a bit high or low before any pixels are consulted. The bias trick rolls it into W (append a 1 to x; one extra column on W) so the equation becomes just
s = W · x. - Geometrically, one row of W plus its bias = one hyperplane in pixel space. Prediction asks which side of each of the K hyperplanes the image is on. The bias is what lets the hyperplane sit somewhere other than through the origin.
- The structural limit: one template per class. When a class is multi-modal (left-facing horses vs right-facing horses; red cars vs blue cars), the single learned template becomes a ghostly compromise that matches neither mode crisply. On CIFAR-10 this caps a linear classifier near ~40 percent accuracy. The rest of the track is about getting past this.
What changes for you
Section titled “What changes for you”This equation is the smallest concrete thing computer vision does, and you can see its shape in everything bigger. The final layer of nearly every modern vision model is still s = W · x + b. What changes is the x it sees: in this lesson, x is raw pixels and the classifier is doing all the work, badly; in a convolutional network (Phase 2), x is a much richer set of learned features and the same linear classifier sits on top of them with an easy job. The equation never goes away; it is the unchanging tip of every modern vision stack, and the lessons ahead add the body underneath. Knowing it is template matching rather than understanding also helps make sense of confident misclassifications, an image gets matched against templates whether or not the templates were built for an image like it.
One equation, one score per class, one template per class, all learned from data. That is the linear classifier, and it is the seed every later model grows from.