Practice: Drawing the widest margin: support vector machines

Self-check

Seven short questions. Try to answer each one before opening the collapsible.

1. When many lines separate two classes, which one does a support vector machine pick?

Show answer

The one with the widest margin: the boundary running down the middle of the widest possible street between the two classes. Maximum distance from the nearest points on each side.

2. What is the margin, and what are support vectors?

Show answer

The margin is the width of the street, the distance from the boundary to the nearest points on each side. The support vectors are those nearest points, sitting on the edges of the street. They alone determine the boundary.

3. What happens to the boundary if you move a point far from it? A support vector?

Show answer

Moving a far-away point changes nothing; the street stays put. Moving a support vector shifts the whole boundary. The model is defined entirely by its closest, hardest examples.

4. What is a soft margin and what does the tradeoff dial (C) control?

Show answer

A soft margin allows some points inside the street or on the wrong side, at a penalty, because real data overlaps. The dial trades margin width against training errors: wider and more forgiving (better generalization) versus narrow and tightly fit (risk of overfitting).

5. What problem does the kernel trick solve?

Show answer

That a straight boundary cannot separate classes that are not linearly separable (like a ring around a center). The kernel trick lets the method draw curved boundaries.

6. How does the kernel trick work, in words?

Show answer

It lifts the data into a higher dimension where a flat boundary can separate the classes, then lets that boundary fold back into a curved one in the original space. A kernel function computes the needed relationships directly, without ever building the high-dimensional coordinates.

7. Why must you scale features before training a support vector machine?

Show answer

Because it is distance-based: a feature on a large numeric scale would swamp one on a small scale and distort the margin. Rescaling to comparable ranges is required (unlike decision trees, which ignore scale).

Try it yourself: lift it with a kernel

These points sit on a single line and cannot be separated by one threshold, because class IN is in the middle and class OUT is on both ends:

IN  (class 1):  at  -2, -1, 0, 1, 2
OUT (class 2):  at  -4, -3, 3, 4

Add a second coordinate equal to each point’s value squared. Write the squared value for each class, and find a horizontal threshold on the squared values that separates IN from OUT.

Show answer

IN  squared:  4, 1, 0, 1, 4   -> range 0 to 4
OUT squared:  16, 9, 9, 16    -> range 9 to 16

Any threshold between 4 and 9 works; about 6.5 cleanly separates them: IN has squared values at most 4 (below 6.5), OUT has at least 9 (above 6.5). A straight cut in the lifted (squared) space is a curved rule back on the original line: “IN if close to zero, OUT if far.” That is the kernel trick in miniature.

Try it yourself: which points matter?

You train a support vector machine on 1,000 labeled points. After training, its boundary turns out to be determined by just 8 of them. What are those 8 points called, and what happens to the boundary if you delete the other 992 and retrain on only those 8?

Show answer

The 8 points are the support vectors, the ones on the edges of the margin. If you delete the other 992 (all of which sat farther from the boundary) and retrain on just the 8, you get the same boundary. Only the support vectors shape it; the rest of the data is, for the boundary’s purposes, irrelevant. This is why a support vector machine is memory-efficient: it effectively only needs to remember its support vectors.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.

Q. Which boundary does a support vector machine choose?

The maximum-margin boundary: the one running down the middle of the widest possible street between the two classes.

Q. What is the margin?

The width of the street between the classes: the distance from the boundary to the nearest data points on each side. The SVM maximizes it.

Q. What are support vectors?

The data points sitting on the edges of the margin, closest to the boundary. They alone determine the boundary; the rest of the data does not matter.

Q. What happens if you move a support vector?

The whole boundary shifts. Moving a point far from the boundary, by contrast, changes nothing.

Q. What is the soft margin?

Allowing some points inside the street or on the wrong side, at a penalty, since real data overlaps. A parameter (C) trades margin width against training errors.

Q. What does the kernel trick do?

It lifts data into a higher dimension where a flat boundary separates the classes, which becomes a curved boundary back in the original space, letting an SVM separate non-linear classes.

Q. Why is it called a 'trick'?

Because a kernel function computes the needed relationships directly, without ever building the high-dimensional coordinates, so it stays fast even when the implied space is huge.

Q. Name two common kernels.

The polynomial kernel and the radial basis function (RBF) kernel. RBF is a common default for curved boundaries.

Q. Why must you scale features for an SVM?

It is distance-based, so a large-scale feature would swamp a small-scale one and distort the margin. Rescale first (decision trees, by contrast, do not need this).

Q. Name one strength and one weakness of SVMs.

Strength (any): effective in high dimensions, memory-efficient, handles non-linear boundaries via kernels. Weakness (any): slow on huge datasets, sensitive to kernel choice, no native probabilities, less interpretable.