Skip to content

Practice: Linear classifiers

Seven short questions. Answer each in your head (or on paper) before opening the collapsible. Active retrieval beats re-reading.

1. Write the linear classifier’s score function from memory, and name every symbol.

Show answer

s = W · x + b. x is the image flattened into a column of D pixel numbers; W is the K-by-D weight matrix (K classes, D pixels); b is the K-vector of biases (one per class); s is the K scores. Pick the largest entry of s for the prediction.

2. For CIFAR-10, what are the shapes of x, W, b, and s?

Show answer

CIFAR-10 images are 32 by 32 by 3 = 3072 pixels, and there are 10 classes. So x is [3072 by 1], W is [10 by 3072], b is [10 by 1], s is [10 by 1].

3. In plain words, what is one row of W?

Show answer

A learned template for one class. Reshaped back into image dimensions, it shows the brightness pattern that scores high for that class. The score is a dot product (template-matching) between this learned template and the input image.

4. What is the bias’s job, and what is the bias trick?

Show answer

The bias b is a per-class default offset added to every score, independent of the input; it can lean a class up or down regardless of pixels. The bias trick rolls it into W: append a constant 1 to x and one extra column to W. Then s = W·x (no separate + b) computes the same scores. Same math, less bookkeeping.

5. What is the geometric picture of one class’s row of W plus its bias?

Show answer

A flat boundary (a hyperplane) in the high-dimensional pixel space. Images on the positive side of that hyperplane score high for the class; images on the negative side score low. The bias is what lets the hyperplane sit somewhere other than through the origin; without it every boundary would have to pass through the all-zero image.

6. Why does a linear classifier struggle with a multi-modal class like “horse”?

Show answer

Because it has exactly one template (one hyperplane) per class. If “horse” really has two distinct looks (left-facing and right-facing), the single learned template becomes a blurred compromise (the famously ghostly two-headed horse template in CIFAR-10), and it matches neither mode crisply. Real classes are often multi-modal; this is the structural limit.

7. Are the scores s = Wx + b probabilities?

Show answer

No. They are unbounded real numbers; they can be negative, zero, or arbitrarily large. They rank the classes, and that ranking is enough to predict. Turning them into probabilities is a separate step (softmax), covered in a later lesson.

Try it yourself: compute a prediction and count the knobs

Section titled “Try it yourself: compute a prediction and count the knobs”

Three short exercises, paper only, about 15 minutes.

Part A: predict by hand. Use the score function. The “image” is 2 by 2 grayscale, so D = 4. There are 3 classes (X, Y, Z). Flattened, the image is:

x = [0.4, 0.8, 0.2, 0.6]

The learned weights and biases are:

W = | 1.0 0.0 -1.0 1.0 | b = | 0.0 |
| 0.0 1.0 0.0 -1.0 | | 0.1 |
| 1.0 -1.0 1.0 0.0 | | -0.2 |

Compute s_X, s_Y, s_Z. Which class is predicted?

Worked answer
s_X = (1.0)(0.4) + (0.0)(0.8) + (-1.0)(0.2) + (1.0)(0.6) + 0.0
= 0.4 + 0 - 0.2 + 0.6 + 0 = 0.8
s_Y = (0.0)(0.4) + (1.0)(0.8) + (0.0)(0.2) + (-1.0)(0.6) + 0.1
= 0 + 0.8 + 0 - 0.6 + 0.1 = 0.3
s_Z = (1.0)(0.4) + (-1.0)(0.8) + (1.0)(0.2) + (0.0)(0.6) + (-0.2)
= 0.4 - 0.8 + 0.2 + 0 - 0.2 = -0.4

Scores [0.8, 0.3, -0.4]. The largest is X, so the prediction is X. That is the whole classifier: one dot product per class, plus a bias, then pick the max.

Part B: count the knobs. For each image size and class count, compute the total number of learned numbers (weights in W plus biases in b).

  1. 8 by 8 grayscale image, 3 classes.
  2. 32 by 32 color (CIFAR-10), 10 classes.
  3. 224 by 224 color, 1000 classes (an ImageNet-scale classifier).
Answers
  1. D = 8 × 8 × 1 = 64 pixels. W is 3 × 64 = 192 weights; b is 3. Total: 195 numbers.
  2. D = 32 × 32 × 3 = 3072. W is 10 × 3072 = 30,720 weights; b is 10. Total: 30,730 numbers.
  3. D = 224 × 224 × 3 = 150,528. W is 1000 × 150,528 = 150,528,000 weights; b is 1000. Total: 150,529,000 numbers.

The jump from a toy 8 × 8 to ImageNet-scale is six orders of magnitude. Every one of those numbers has to be learned from data, which is one reason large vision datasets matter so much.

Part C: reasoning. Suppose the training images for class “horse” split roughly half-and-half into horses facing left and horses facing right (in different parts of the image). Argue, in one or two sentences, why a single linear template cannot capture both modes well at once. What would the single learned template tend to look like?

What a good answer looks like

A single row of W can place positive weights where horses typically have bright pixels and negative weights where they typically have dark pixels, but a left-facing horse and a right-facing horse put bright pixels in different places. To score both modes high, the learned template ends up positive in both regions at once, which is a blurred superposition that matches neither cleanly. Visualized, it looks like a ghostly two-headed horse, exactly the compromise that motivates moving past a single template per class (next, loss and optimization; later, neural networks that can hold many templates).

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What is the linear classifier's score function?
A.

s = W · x + b. The image x (D pixel numbers in a column) is multiplied by the K-by-D weight matrix W and a K-vector of biases b is added, giving K scores. The largest score is the prediction.

Q. What are the shapes for CIFAR-10?
A.

x is [3072 by 1] (32 × 32 × 3 pixels), W is [10 by 3072], b is [10 by 1], s is [10 by 1] for the 10 classes.

Q. What is one row of W?
A.

A learned template for one class. Reshaped back to image dimensions it is the brightness pattern the class scores high on; the score is a dot product between this template and the image.

Q. What does the bias b do?
A.

Adds a per-class offset to every score, independent of the input. It lets the classifier lean toward or away from a class as a default before any pixels are considered.

Q. What is the bias trick?
A.

Append a constant 1 to x and one extra column to W to absorb the biases. Then s = W · x (no separate + b) gives the same scores. Same math, less bookkeeping.

Q. What is the geometric interpretation of a row of W plus its bias?
A.

A flat boundary (hyperplane) in pixel space. Images on its positive side score high for that class; the bias is what lets the boundary sit anywhere other than through the origin.

Q. Why does a linear classifier struggle with multi-modal classes?
A.

It has one template (one hyperplane) per class. Distinct modes within a class (left-facing vs right-facing horses, multiple colors of cars) get merged into a single blurred compromise template that matches none well.

Q. Are the scores produced by `Wx + b` probabilities?
A.

No. They are unbounded real numbers and only rank the classes. Turning them into probabilities is a separate step (softmax), in a later lesson.

Q. Why does the linear classifier still matter when modern CNNs exist?
A.

It is the final layer of nearly every modern vision model: a CNN extracts rich features, then a linear classifier s = Wx + b maps those features to class scores. The equation never goes away; what changes is what feeds it.