Maximum margin classifiers. Kernel trick for nonlinearity.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Support Vector Machines (SVMs) are one of the cleanest examples of how geometry, convex optimization, and linear algebra combine into a powerful ML algorithm: pick the separating hyperplane that leaves the widest safety buffer (margin) between classes—and if the data isn’t linearly separable, quietly switch to a richer feature space using a kernel, without ever computing those features explicitly.
An SVM chooses a decision boundary that maximizes the margin (roughly, the distance to the closest training points). Only the closest points—support vectors—determine the solution, producing a sparse model in the dual form. With the kernel trick, inner products are replaced by a positive-definite kernel , enabling nonlinear decision boundaries while solving a convex optimization problem.
In binary classification you often want a rule that separates two classes as reliably as possible. If the data are roughly linearly separable, many hyperplanes can separate them—but some are fragile: a tiny perturbation in the data could flip predictions.
SVMs add a strong geometric preference:
That “margin” is a built-in robustness buffer. Intuitively, if the boundary is far from the training points, small noise in the inputs is less likely to cross the boundary.
A linear classifier uses a hyperplane:
For a point x, its signed distance to the hyperplane is:
So the distance scales like . To talk about margins in a consistent way, SVMs use canonical scaling: choose the scale of w and so that the closest points satisfy
Under that choice:
Many texts call the margin (distance from boundary to the closest points), while others emphasize the full band width . Either way, maximizing the margin is equivalent to minimizing .
If the data are perfectly separable, we want:
This gives the classic hard-margin SVM primal problem:
Why ? It’s convex and differentiable, and the factor cancels nicely in derivatives.
Interactive canvas idea (1): show a 2D dataset with two classes. Provide sliders to rotate/translate a candidate hyperplane and display:
Let the user move the boundary and watch the margin shrink/grow, with a live readout of .
Static diagram (for non-canvas readers):
<svg xmlns="http://www.w3.org/2000/svg" width="720" height="260" viewBox="0 0 720 260" role="img" aria-label="SVM margin: decision boundary and two margin lines with support vectors">
<rect x="0" y="0" width="720" height="260" fill="#ffffff"/>
<!-- axes -->
<line x1="60" y1="220" x2="680" y2="220" stroke="#333" stroke-width="2"/>
<line x1="60" y1="220" x2="60" y2="30" stroke="#333" stroke-width="2"/>
<!-- decision boundary and margins -->
<line x1="170" y1="220" x2="510" y2="30" stroke="#1f77b4" stroke-width="3"/>
<line x1="140" y1="220" x2="480" y2="30" stroke="#1f77b4" stroke-width="2" stroke-dasharray="8,6"/>
<line x1="200" y1="220" x2="540" y2="30" stroke="#1f77b4" stroke-width="2" stroke-dasharray="8,6"/>
<text x="520" y="60" font-family="sans-serif" font-size="14" fill="#1f77b4">w·x+b=0</text>
<text x="545" y="80" font-family="sans-serif" font-size="12" fill="#1f77b4">w·x+b=+1</text>
<text x="455" y="90" font-family="sans-serif" font-size="12" fill="#1f77b4">w·x+b=-1</text>
<!-- points (class +1) -->
<circle cx="520" cy="70" r="7" fill="#2ca02c"/>
<circle cx="600" cy="110" r="7" fill="#2ca02c"/>
<circle cx="610" cy="60" r="7" fill="#2ca02c"/>
<!-- points (class -1) -->
<circle cx="150" cy="185" r="7" fill="#d62728"/>
<circle cx="210" cy="170" r="7" fill="#d62728"/>
<circle cx="250" cy="205" r="7" fill="#d62728"/>
<!-- support vectors (highlighted) -->
<circle cx="520" cy="70" r="12" fill="none" stroke="#000" stroke-width="2"/>
<circle cx="210" cy="170" r="12" fill="none" stroke="#000" stroke-width="2"/>
<text x="90" y="45" font-family="sans-serif" font-size="14" fill="#000">Support vectors lie on the dashed margin lines</text>
</svg>This diagram emphasizes a key idea you’ll return to: the boundary is “pinned” by the closest points.
Perfect separability is rare. Noise, overlap, and mislabeled points are common.
If you insist on for every point, you may get:
SVMs handle this with slack variables that allow violations:
Interpretation:
We now trade off large margin vs. violations:
controls the trade-off:
A helpful way to remember this:
You can rewrite the constrained soft-margin problem into an unconstrained form using the hinge loss
where .
At optimum, becomes exactly the hinge loss:
So the primal is equivalent to:
This is a useful lens because it makes SVMs feel like “regularized empirical risk minimization,” just with hinge loss instead of logistic loss.
Interactive canvas idea: a slider for that recomputes the separating line (in 2D linear case) and updates:
Learners should see that increasing often pulls the boundary toward outliers to fix them, shrinking the margin.
Because SVMs depend on inner products and distances, feature scaling matters.
If one feature has values in thousands and another in tenths, the large-scale feature dominates and . Standard practice:
Both hard- and soft-margin SVMs are convex problems:
That convexity is why SVMs historically earned a reputation for reliability: there is a single global optimum (up to degeneracies).
But the most elegant part is what happens when we transform the problem into its dual: it will reveal support vectors and the kernel trick naturally.
You already know Lagrange multipliers for equality constraints. For SVMs we have inequality constraints, so we use the Karush–Kuhn–Tucker (KKT) framework.
The payoff for deriving the dual is big:
We’ll derive the soft-margin dual (hard-margin is a special case).
Primal (soft-margin) again:
subject to
Introduce Lagrange multipliers:
The Lagrangian is:
Take partial derivatives and set to zero.
With respect to w:
This is the first major result: w is a linear combination of training points.
With respect to :
With respect to :
Since , we get the box constraint:
Substitute into the Lagrangian and eliminate using stationarity. After simplification, the dual becomes:
subject to
This is a convex quadratic program in (maximize a concave quadratic).
Once you solve for , the decision function is:
Only points with contribute. These are the support vectors.
From KKT complementary slackness:
So:
This yields a practical taxonomy:
| Point location | Condition | Typical α | Role |
|---|---|---|---|
| Outside margin | 0 | irrelevant to boundary | |
| On margin | (0, C) | “geometric” support vector | |
| Inside margin / misclassified | C | “error” support vector |
Interactive canvas idea (2): show a solved linear SVM in 2D with support vectors highlighted. Allow the user to:
and recompute the SVM.
Expected visual lesson:
To make this explicit, display:
This turns “only support vectors matter” from a slogan into an observed fact.
Once you have , you can compute using any support vector with (i.e., on-margin, not at the box constraint). For such a point, and
So
In practice you average over all margin support vectors to reduce numerical noise.
Prediction evaluates
If the number of support vectors is small, this is fast. This sparsity is a real advantage over methods that require all points at prediction time.
But note the caveat: with some kernels and some choices, the number of support vectors can become large (even close to ), making prediction slower.
A linear separator in the original input space might be impossible even if the data are “simple” in a different representation.
Classic example: concentric circles. No line separates inner vs outer ring.
Idea: map inputs through a feature map so that classes become linearly separable in feature space:
But explicitly constructing could be expensive or infinite-dimensional.
In the dual, data only appear inside dot products:
If we instead operate in feature space, we would need:
A kernel is a function
for some (possibly implicit) feature map , provided is positive-definite (Mercer condition).
Then the dual becomes:
and prediction becomes:
That’s the kernel trick: you get nonlinear decision boundaries in input space while solving a convex problem that only needs kernel evaluations.
| Kernel | Formula | Main parameter(s) | What it implies | ||
|---|---|---|---|---|---|
| Linear | none | linear boundary in input space | |||
| Polynomial | interactions up to degree | ||||
| RBF / Gaussian | $K=\exp(-\gamma\ | \mathbf{x}-\mathbf{x}'\ | ^2)$ | local similarity; flexible smooth boundaries | |
| Sigmoid (less common) | related to neural nets; not always PD |
A practical intuition for RBF:
And remember: and kernel parameters interact. High + high can easily overfit.
Interactive canvas idea (3): present a concentric-circles dataset.
Panel A (input space): show points and the learned nonlinear boundary for an RBF SVM.
Panel B (a chosen feature space view): show an illustrative mapping where the same points become linearly separable.
For circles, an instructive explicit mapping is to use a radial feature like . In a 1D feature space of , the classes may separate by a threshold (a “linear” separator in that 1D feature). More generally, you can visualize a 3D feature map:
Then a plane in this 3D space can correspond to a circle-like boundary when projected back to 2D.
Important honesty: the RBF kernel’s actual feature space is infinite-dimensional, so Panel B is an illustration of the concept, not literally the RBF feature map.
Static diagram (for non-canvas readers):
<svg xmlns="http://www.w3.org/2000/svg" width="720" height="300" viewBox="0 0 720 300" role="img" aria-label="Kernel idea: circles not separable in 2D become separable after mapping using radius-squared feature">
<rect x="0" y="0" width="720" height="300" fill="#fff"/>
<text x="60" y="30" font-family="sans-serif" font-size="16" fill="#000">Input space (x₁,x₂): concentric circles</text>
<text x="420" y="30" font-family="sans-serif" font-size="16" fill="#000">Feature (r²): linear threshold</text>
<!-- left axes -->
<line x1="60" y1="250" x2="330" y2="250" stroke="#333" stroke-width="2"/>
<line x1="60" y1="250" x2="60" y2="60" stroke="#333" stroke-width="2"/>
<!-- circles of points -->
<circle cx="195" cy="155" r="35" fill="none" stroke="#d62728" stroke-width="2"/>
<circle cx="195" cy="155" r="80" fill="none" stroke="#2ca02c" stroke-width="2"/>
<!-- sample points -->
<circle cx="195" cy="120" r="5" fill="#d62728"/>
<circle cx="230" cy="155" r="5" fill="#d62728"/>
<circle cx="195" cy="190" r="5" fill="#d62728"/>
<circle cx="165" cy="155" r="5" fill="#d62728"/>
<circle cx="195" cy="75" r="5" fill="#2ca02c"/>
<circle cx="275" cy="155" r="5" fill="#2ca02c"/>
<circle cx="195" cy="235" r="5" fill="#2ca02c"/>
<circle cx="115" cy="155" r="5" fill="#2ca02c"/>
<!-- right axes for r^2 -->
<line x1="420" y1="250" x2="690" y2="250" stroke="#333" stroke-width="2"/>
<line x1="420" y1="250" x2="420" y2="60" stroke="#333" stroke-width="2"/>
<text x="560" y="275" font-family="sans-serif" font-size="12">r²</text>
<!-- threshold -->
<line x1="560" y1="250" x2="560" y2="60" stroke="#1f77b4" stroke-width="2" stroke-dasharray="6,5"/>
<text x="565" y="80" font-family="sans-serif" font-size="12" fill="#1f77b4">threshold</text>
<!-- points on line (r^2) -->
<circle cx="500" cy="170" r="6" fill="#d62728"/>
<circle cx="510" cy="140" r="6" fill="#d62728"/>
<circle cx="520" cy="200" r="6" fill="#d62728"/>
<circle cx="620" cy="120" r="6" fill="#2ca02c"/>
<circle cx="640" cy="180" r="6" fill="#2ca02c"/>
<circle cx="650" cy="150" r="6" fill="#2ca02c"/>
<text x="430" y="55" font-family="sans-serif" font-size="12" fill="#000">(conceptual) mapping: (x₁,x₂) → r²=x₁²+x₂²</text>
</svg>In practice, an SVM’s performance depends heavily on hyperparameters:
A typical approach:
Strengths
Limitations
Everything about “kernel methods” generalizes beyond SVMs: ridge regression, PCA variants, Gaussian processes, etc. The SVM is your first major encounter with the idea that inner products are the computational interface to a possibly huge feature space.
You’re now ready for: Kernel Methods.
Suppose a linear classifier is given by and . Consider the point . (1) Compute its signed distance to the decision boundary. (2) Compute the geometric margin band width . (3) Explain why you can’t read off the SVM margin unless the classifier is in canonical scaling.
Compute the norm of w:
\n.
Compute the signed value of the decision function:
\n.
Convert to signed distance:
\n$$
\nSo the point lies on the positive side of the hyperplane, 0.8 units away (in Euclidean distance).
Compute the margin band width (distance between and ):
\n$$
Explain scaling:
\nIf we scale by any constant , the decision boundary is unchanged because
\n is equivalent.
\nBut , so would change even though the boundary is the same.
\nSVM’s canonical scaling fixes this ambiguity by enforcing that the closest points satisfy , making the margin a well-defined geometric quantity.
Insight: The “margin” isn’t just about the boundary line; it’s about the boundary plus a chosen scale. SVM selects the scale by pinning the closest points to functional value ±1, turning into a true geometric distance.
You trained a (linear) soft-margin SVM and obtained nonzero dual coefficients for only two training points:
Assume . (1) Compute w. (2) Write the decision function . (3) Classify .
Compute w using :
\n
\n.
Write the decision function:
\n
\nSo
\n$$
Evaluate at :
\n.
Classify using sign:
\n.
Insight: Even if you had 10,000 training points, if only two have nonzero α, prediction depends only on those two points. That’s the operational meaning of “support vectors determine the classifier.”
You have a kernel SVM with two support vectors:
Bias . Use an RBF kernel with .
Compute and the predicted label for .
Write the kernel decision function:
\n$$
Compute :
\n
\nSo
\n.
Compute :
\n
\nSo
\n as well.
Assemble the score:
\n
\n
\n.
Numerical approximation: .
\nSo .
Predict label:
\n.
Insight: Kernel SVM prediction looks like a weighted vote of similarities to support vectors. With RBF, each support vector contributes most to nearby points and fades with distance.
SVMs choose the separating hyperplane that maximizes the margin; under canonical scaling, margin is proportional to (band width ).
Hard-margin SVM requires perfect separability; soft-margin SVM introduces slack variables and trade-off parameter .
The soft-margin objective corresponds to minimizing (hinge loss + L2 regularization).
In the dual, the solution is expressed by coefficients with constraints and .
Only points with matter at prediction time; these are the support vectors, which “pin” the optimal boundary.
The kernel trick replaces dot products with a positive-definite kernel , enabling nonlinear decision boundaries while keeping convex optimization.
Hyperparameters (, kernel parameters like RBF ) strongly affect bias/variance and the number of support vectors; scaling features is essential.
SVMs are powerful for small-to-medium datasets and can be very robust, but kernel SVMs can be costly for very large due to the kernel matrix.
Confusing the decision boundary scale ambiguity: scaling (w, b) changes but not the boundary; the SVM margin definition relies on canonical scaling.
Forgetting to standardize features before training, causing one feature to dominate inner products and distorting margins and kernel behavior.
Assuming all training points influence the solution equally; in SVMs, non-support vectors often have α = 0 and do not affect the classifier.
Overfitting with RBF kernels by choosing both large and large , producing very wiggly boundaries and many support vectors.
You are given a hyperplane , . (a) Compute the distance from to the hyperplane. (b) What is the margin band width ? (c) If you scale (w, b) by 10, do (a) and (b) change?
Hint: Use and remember .
(a) . Compute . Distance .\n\n(b) Band width .\n\n(c) Scaling (w, b) by 10 leaves the boundary unchanged. The signed distance to the boundary is unchanged because numerator and denominator both scale by 10. But the expression computed from the scaled w becomes , which shows why margin is only meaningful once canonical scaling is fixed.
A soft-margin SVM solution has three points with coefficients: , , . (a) Which points are support vectors? (b) Which point is likely violating the margin? (c) What condition must hold among labels and coefficients ?
Hint: Support vectors have . Points with are often inside the margin or misclassified. The dual equality constraint is .
(a) Points 2 and 3 are support vectors because they have .\n\n(b) Point 3 (with ) is likely a margin violator (inside the margin and/or misclassified).\n\n(c) The coefficients must satisfy the dual constraint , i.e., .
Consider an RBF kernel . (a) What happens to as ? (b) What happens as for ? (c) How would these extremes affect the flexibility of an SVM decision boundary?
Hint: Use limits of as changes; interpret kernel value as similarity/influence.
(a) As , for all pairs. The kernel matrix becomes approximately all ones (very low effective complexity).\n\n(b) As and , so and . Also, always. The kernel matrix approaches the identity.\n\n(c) Small makes all points look similar, encouraging very smooth/simple boundaries (can underfit). Very large makes similarity extremely local, allowing highly flexible boundaries that can interpolate noise (risk of overfitting), especially with large .
Next: Kernel Methods
Related nodes you may also connect in your mental map: