Support Vector Machines

Machine LearningDifficulty: ████Depth: 9Unlocks: 1

Maximum margin classifiers. Kernel trick for nonlinearity.

Interactive Visualization

t=0s

Core Concepts

  • Maximum-margin separating hyperplane (decision boundary w·x+b=0 with canonical scaling so margin = 1/||w||)
  • Support vectors and sparsity (only training points with active constraints determine the classifier)
  • Kernel trick (use of a positive-definite kernel as an implicit inner product to enable nonlinear separation)

Key Symbols & Notation

alpha_i (dual coefficient / Lagrange multiplier for training example i)K(x,x') (positive-definite kernel function = inner product in feature space)

Essential Relationships

  • Primal-dual and kernelized decision function: w = sum_i alpha_i y_i x_i, so f(x)=sign(sum_i alpha_i y_i K(x_i,x)+b); only alpha_i>0 (support vectors) contribute and margin = 1/||w|| under canonical scaling
▶ Advanced Learning Details

Graph Position

81
Depth Cost
1
Fan-Out (ROI)
1
Bottleneck Score
9
Chain Length

Cognitive Load

6
Atomic Elements
61
Total Elements
L4
Percentile Level
L4
Atomic Level

All Concepts (26)

  • Linear separating hyperplane as classifier: decision boundary defined by w·x + b = 0
  • Geometric margin: signed distance from a point to the separating hyperplane
  • Maximum-margin principle: choose hyperplane that maximizes the minimum (geometric) margin
  • Support vectors: training examples that lie on or inside the margin and determine the solution
  • Hard-margin SVM: maximum-margin classifier when data are linearly separable (no errors allowed)
  • Soft-margin SVM: margin maximization allowing classification errors via slack variables
  • Slack variables (ξ_i): nonnegative variables measuring margin violations for each training point
  • Regularization parameter C: trade-off parameter between margin size and slack (misclassification) penalty
  • Hinge loss: loss function max(0, 1 - y f(x)) that underlies soft-margin SVM
  • Primal SVM optimization problem: objective and constraints in w, b, and ξ (quadratic objective plus linear constraints)
  • Dual SVM optimization problem: quadratic program in dual variables (α_i) involving only inner products of training points
  • Representation theorem for SVMs: primal weight vector w expressed as a linear combination of training points weighted by α_i y_i
  • Kernel function K(x, x'): a function that computes an inner product in some (possibly high- or infinite-dimensional) feature space
  • Kernel trick: replace inner products ⟨φ(x), φ(x')⟩ with K(x,x') to train non-linear SVMs without explicit feature mapping
  • Kernel (Gram) matrix: matrix of pairwise kernel evaluations K_ij = K(x_i, x_j)
  • Positive semi-definiteness / Mercer condition for kernels: condition that a function be a valid inner-product kernel
  • Common kernel families and their qualitative effects (linear, polynomial, Gaussian/RBF, sigmoid)
  • Dual-to-primal link: bias term b and decision function recovered from dual solution (α_i)
  • Sparsity of dual solution: only support vectors have nonzero α_i, leading to sparse decision function
  • KKT conditions specialized to SVM: complementary slackness implications relating α_i, ξ_i, and classification margin
  • Bounds on dual variables in soft-margin: 0 ≤ α_i ≤ C and equality constraint sum_i α_i y_i = 0
  • Decision function in kernelized form: f(x) = sign( sum_i α_i y_i K(x_i, x) + b )
  • Interpretation of C as inverse regularization strength and its effect on margin/generalization
  • Equivalence/relationship between minimizing ||w|| (or ||w||^2) and maximizing margin
  • Influence of kernel choice on implicit feature space dimensionality (e.g., RBF → infinite-dimensional)
  • Computational implications: training scales with number of training examples (quadratic/greater for dense kernels); prediction cost scales with number of support vectors

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Support Vector Machines (SVMs) are one of the cleanest examples of how geometry, convex optimization, and linear algebra combine into a powerful ML algorithm: pick the separating hyperplane that leaves the widest safety buffer (margin) between classes—and if the data isn’t linearly separable, quietly switch to a richer feature space using a kernel, without ever computing those features explicitly.

TL;DR:

An SVM chooses a decision boundary wx+b=0\mathbf{w}\cdot\mathbf{x}+b=0 that maximizes the margin (roughly, the distance to the closest training points). Only the closest points—support vectors—determine the solution, producing a sparse model in the dual form. With the kernel trick, inner products ϕ(x)ϕ(x)\phi(\mathbf{x})\cdot\phi(\mathbf{x}') are replaced by a positive-definite kernel K(x,x)K(\mathbf{x},\mathbf{x}'), enabling nonlinear decision boundaries while solving a convex optimization problem.

What Is a Support Vector Machine?

The problem SVMs are trying to solve (why before how)

In binary classification you often want a rule that separates two classes as reliably as possible. If the data are roughly linearly separable, many hyperplanes can separate them—but some are fragile: a tiny perturbation in the data could flip predictions.

SVMs add a strong geometric preference:

  • Don’t just separate the classes.
  • Separate them with the largest possible margin.

That “margin” is a built-in robustness buffer. Intuitively, if the boundary is far from the training points, small noise in the inputs is less likely to cross the boundary.

The decision boundary

A linear classifier uses a hyperplane:

wx+b=0\mathbf{w}\cdot\mathbf{x}+b=0
  • w is the normal vector (perpendicular to the boundary)
  • bb is the offset (bias)
  • prediction is typically y^=sign(wx+b)\hat{y}=\operatorname{sign}(\mathbf{w}\cdot\mathbf{x}+b) with labels y{+1,1}y\in\{+1,-1\}

What “margin” means (geometrically)

For a point x, its signed distance to the hyperplane is:

dist(x,w,b)=wx+bw\text{dist}(\mathbf{x}, \mathbf{w}, b)=\frac{\mathbf{w}\cdot\mathbf{x}+b}{\|\mathbf{w}\|}

So the distance scales like 1/w1/\|\mathbf{w}\|. To talk about margins in a consistent way, SVMs use canonical scaling: choose the scale of w and bb so that the closest points satisfy

yi(wxi+b)=1y_i(\mathbf{w}\cdot\mathbf{x}_i+b)=1

Under that choice:

  • the two “margin” hyperplanes are
wx+b=+1andwx+b=1\mathbf{w}\cdot\mathbf{x}+b=+1 \quad\text{and}\quad \mathbf{w}\cdot\mathbf{x}+b=-1
  • the distance between them is
2w\frac{2}{\|\mathbf{w}\|}

Many texts call the margin 1w\frac{1}{\|\mathbf{w}\|} (distance from boundary to the closest points), while others emphasize the full band width 2w\frac{2}{\|\mathbf{w}\|}. Either way, maximizing the margin is equivalent to minimizing w\|\mathbf{w}\|.

The “maximum margin” optimization (hard-margin)

If the data are perfectly separable, we want:

  • correct classification with margin constraints:
yi(wxi+b)1iy_i(\mathbf{w}\cdot\mathbf{x}_i+b)\ge 1 \quad \forall i
  • maximum margin, i.e. minimize w\|\mathbf{w}\|.

This gives the classic hard-margin SVM primal problem:

minw,b  12w2s.t.yi(wxi+b)1  i\min_{\mathbf{w},b}\; \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{s.t.}\quad y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\ge 1\;\forall i

Why 12w2\frac{1}{2}\|\mathbf{w}\|^2? It’s convex and differentiable, and the factor 1/21/2 cancels nicely in derivatives.

Visualization focus: margin vs decision boundary

Interactive canvas idea (1): show a 2D dataset with two classes. Provide sliders to rotate/translate a candidate hyperplane and display:

  • the decision boundary wx+b=0\mathbf{w}\cdot\mathbf{x}+b=0
  • the two margin lines wx+b=±1\mathbf{w}\cdot\mathbf{x}+b=\pm 1
  • the current margin width 2/w2/\|\mathbf{w}\|

Let the user move the boundary and watch the margin shrink/grow, with a live readout of w\|\mathbf{w}\|.

Static diagram (for non-canvas readers):

<svg xmlns="http://www.w3.org/2000/svg" width="720" height="260" viewBox="0 0 720 260" role="img" aria-label="SVM margin: decision boundary and two margin lines with support vectors">
  <rect x="0" y="0" width="720" height="260" fill="#ffffff"/>
  <!-- axes -->
  <line x1="60" y1="220" x2="680" y2="220" stroke="#333" stroke-width="2"/>
  <line x1="60" y1="220" x2="60" y2="30" stroke="#333" stroke-width="2"/>
  <!-- decision boundary and margins -->
  <line x1="170" y1="220" x2="510" y2="30" stroke="#1f77b4" stroke-width="3"/>
  <line x1="140" y1="220" x2="480" y2="30" stroke="#1f77b4" stroke-width="2" stroke-dasharray="8,6"/>
  <line x1="200" y1="220" x2="540" y2="30" stroke="#1f77b4" stroke-width="2" stroke-dasharray="8,6"/>
  <text x="520" y="60" font-family="sans-serif" font-size="14" fill="#1f77b4">w·x+b=0</text>
  <text x="545" y="80" font-family="sans-serif" font-size="12" fill="#1f77b4">w·x+b=+1</text>
  <text x="455" y="90" font-family="sans-serif" font-size="12" fill="#1f77b4">w·x+b=-1</text>
  <!-- points (class +1) -->
  <circle cx="520" cy="70" r="7" fill="#2ca02c"/>
  <circle cx="600" cy="110" r="7" fill="#2ca02c"/>
  <circle cx="610" cy="60" r="7" fill="#2ca02c"/>
  <!-- points (class -1) -->
  <circle cx="150" cy="185" r="7" fill="#d62728"/>
  <circle cx="210" cy="170" r="7" fill="#d62728"/>
  <circle cx="250" cy="205" r="7" fill="#d62728"/>
  <!-- support vectors (highlighted) -->
  <circle cx="520" cy="70" r="12" fill="none" stroke="#000" stroke-width="2"/>
  <circle cx="210" cy="170" r="12" fill="none" stroke="#000" stroke-width="2"/>
  <text x="90" y="45" font-family="sans-serif" font-size="14" fill="#000">Support vectors lie on the dashed margin lines</text>
</svg>

This diagram emphasizes a key idea you’ll return to: the boundary is “pinned” by the closest points.

Core Mechanic 1: Maximum-Margin Optimization and the Soft Margin

Why the hard-margin version is not enough

Perfect separability is rare. Noise, overlap, and mislabeled points are common.

If you insist on yi(wxi+b)1y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\ge 1 for every point, you may get:

  • infeasibility (no solution)
  • or an overly complex boundary in feature space (when kernels are used)

SVMs handle this with slack variables ξi0\xi_i\ge 0 that allow violations:

yi(wxi+b)1ξiy_i(\mathbf{w}\cdot\mathbf{x}_i+b)\ge 1-\xi_i

Interpretation:

  • ξi=0\xi_i=0: point is correctly classified and outside/on the margin.
  • 0<ξi<10<\xi_i<1: correctly classified but inside the margin.
  • ξi1\xi_i\ge 1: misclassified.

The soft-margin primal objective

We now trade off large margin vs. violations:

minw,b,ξ  12w2+Ci=1nξis.t.yi(wxi+b)1ξi,  ξi0\min_{\mathbf{w},b,\boldsymbol{\xi}}\; \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^n \xi_i \quad\text{s.t.}\quad y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\ge 1-\xi_i,\;\xi_i\ge 0

C>0C>0 controls the trade-off:

  • large CC: violations are expensive → narrower margin, fewer training errors
  • small CC: violations are tolerated → wider margin, possibly more training errors

A helpful way to remember this:

  • 12w2\frac{1}{2}\|\mathbf{w}\|^2 is a capacity/complexity penalty
  • ξi\sum \xi_i is a training loss (linear penalty on margin violations)

Connecting to hinge loss

You can rewrite the constrained soft-margin problem into an unconstrained form using the hinge loss

hinge(y,f)=max(0,1yf)\ell_{\text{hinge}}(y, f)=\max(0, 1-y f)

where f(x)=wx+bf(\mathbf{x})=\mathbf{w}\cdot\mathbf{x}+b.

At optimum, ξi\xi_i becomes exactly the hinge loss:

ξi=max(0,1yi(wxi+b))\xi_i = \max\big(0, 1-y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\big)

So the primal is equivalent to:

minw,b  12w2+Ci=1nmax(0,1yi(wxi+b))\min_{\mathbf{w},b}\; \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^n \max\big(0, 1-y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\big)

This is a useful lens because it makes SVMs feel like “regularized empirical risk minimization,” just with hinge loss instead of logistic loss.

Visualization focus: how C changes the solution

Interactive canvas idea: a slider for CC that recomputes the separating line (in 2D linear case) and updates:

  • margin width 2/w2/\|\mathbf{w}\|
  • number of margin violations
  • which points are support vectors

Learners should see that increasing CC often pulls the boundary toward outliers to fix them, shrinking the margin.

A careful note about scaling

Because SVMs depend on inner products and distances, feature scaling matters.

If one feature has values in thousands and another in tenths, the large-scale feature dominates wx\mathbf{w}\cdot\mathbf{x} and w\|\mathbf{w}\|. Standard practice:

  • standardize features (zero mean, unit variance) or similar normalization
  • tune CC (and kernel params) after scaling

Where convex optimization shows up

Both hard- and soft-margin SVMs are convex problems:

  • quadratic objective in w
  • linear constraints in (w, bb, ξ\xi)

That convexity is why SVMs historically earned a reputation for reliability: there is a single global optimum (up to degeneracies).

But the most elegant part is what happens when we transform the problem into its dual: it will reveal support vectors and the kernel trick naturally.

Core Mechanic 2: Support Vectors, the Dual Problem, and Sparsity

Why we go to the dual

You already know Lagrange multipliers for equality constraints. For SVMs we have inequality constraints, so we use the Karush–Kuhn–Tucker (KKT) framework.

The payoff for deriving the dual is big:

  1. 1)The classifier can be written entirely in terms of dot products xixj\mathbf{x}_i\cdot\mathbf{x}_j.
  2. 2)The solution becomes sparse: only some points have nonzero coefficients.
  3. 3)That dot-product-only form is exactly what kernels replace.

We’ll derive the soft-margin dual (hard-margin is a special case).

Step 1: Set up constraints and multipliers

Primal (soft-margin) again:

minw,b,ξ  12w2+Ciξi\min_{\mathbf{w},b,\boldsymbol{\xi}}\; \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \xi_i

subject to

yi(wxi+b)1ξi,ξi0y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\ge 1-\xi_i, \quad \xi_i\ge 0

Introduce Lagrange multipliers:

  • αi0\alpha_i\ge 0 for the margin constraints
  • μi0\mu_i\ge 0 for the slack nonnegativity constraints ξi0\xi_i\ge 0

The Lagrangian is:

L(w,b,ξ,α,μ)=12w2+Ciξiiαi(yi(wxi+b)1+ξi)iμiξi\mathcal{L}(\mathbf{w},b,\boldsymbol{\xi},\boldsymbol{\alpha},\boldsymbol{\mu}) = \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \xi_i - \sum_i \alpha_i\big(y_i(\mathbf{w}\cdot\mathbf{x}_i+b)-1+\xi_i\big) - \sum_i \mu_i\xi_i

Step 2: Stationarity conditions (minimize over primal variables)

Take partial derivatives and set to zero.

With respect to w:

Lw=wiαiyixi=0    w=iαiyixi\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \mathbf{w} - \sum_i \alpha_i y_i \mathbf{x}_i = 0 \;\Rightarrow\; \mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i

This is the first major result: w is a linear combination of training points.

With respect to bb:

Lb=iαiyi=0    iαiyi=0\frac{\partial \mathcal{L}}{\partial b} = -\sum_i \alpha_i y_i = 0 \;\Rightarrow\; \sum_i \alpha_i y_i = 0

With respect to ξi\xi_i:

Lξi=Cαiμi=0    αi+μi=C\frac{\partial \mathcal{L}}{\partial \xi_i} = C - \alpha_i - \mu_i = 0 \;\Rightarrow\; \alpha_i + \mu_i = C

Since μi0\mu_i\ge 0, we get the box constraint:

0αiC0 \le \alpha_i \le C

Step 3: Plug back in → the dual objective

Substitute w=iαiyixi\mathbf{w}=\sum_i \alpha_i y_i \mathbf{x}_i into the Lagrangian and eliminate ξ\boldsymbol{\xi} using stationarity. After simplification, the dual becomes:

maxα  i=1nαi12i=1nj=1nαiαjyiyj(xixj)\max_{\boldsymbol{\alpha}}\; \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \alpha_i\alpha_j y_i y_j (\mathbf{x}_i\cdot\mathbf{x}_j)

subject to

0αiC,iαiyi=00\le \alpha_i\le C, \quad \sum_i \alpha_i y_i=0

This is a convex quadratic program in α\boldsymbol{\alpha} (maximize a concave quadratic).

Step 4: The classifier in terms of α

Once you solve for αi\alpha_i, the decision function is:

f(x)=wx+b=(iαiyixi)x+b=iαiyi(xix)+bf(\mathbf{x})=\mathbf{w}\cdot\mathbf{x}+b =\Big(\sum_i \alpha_i y_i \mathbf{x}_i\Big)\cdot\mathbf{x}+b =\sum_i \alpha_i y_i (\mathbf{x}_i\cdot\mathbf{x}) + b

Only points with αi0\alpha_i\ne 0 contribute. These are the support vectors.

What exactly are “support vectors”?

From KKT complementary slackness:

αi(yi(wxi+b)1+ξi)=0\alpha_i\big(y_i(\mathbf{w}\cdot\mathbf{x}_i+b)-1+\xi_i\big)=0

So:

  • If a point is comfortably outside the margin: yi(wxi+b)>1y_i(\mathbf{w}\cdot\mathbf{x}_i+b)>1 and ξi=0\xi_i=0 → typically αi=0\alpha_i=0.
  • If a point lies exactly on the margin: yi(wxi+b)=1y_i(\mathbf{w}\cdot\mathbf{x}_i+b)=1 and ξi=0\xi_i=00<αi<C0<\alpha_i<C.
  • If a point violates the margin: yi(wxi+b)<1y_i(\mathbf{w}\cdot\mathbf{x}_i+b)<1 → usually αi=C\alpha_i=C (at the upper bound) when it’s a “hard” violator.

This yields a practical taxonomy:

Point locationConditionTypical αRole
Outside marginyf(x)>1y f(\mathbf{x})>10irrelevant to boundary
On marginyf(x)=1y f(\mathbf{x})=1(0, C)“geometric” support vector
Inside margin / misclassifiedyf(x)<1y f(\mathbf{x})<1C“error” support vector

Visualization focus: support vectors control the boundary

Interactive canvas idea (2): show a solved linear SVM in 2D with support vectors highlighted. Allow the user to:

  1. 1)drag any non-support point a moderate amount
  2. 2)drag a support vector a moderate amount

and recompute the SVM.

Expected visual lesson:

  • moving non-support points often does not change the boundary (or changes it very little)
  • moving a support vector noticeably moves/rotates the boundary

To make this explicit, display:

  • list/count of support vectors
  • α values next to points (e.g., tiny labels)
  • “boundary change” metric (angle shift, bias shift)

This turns “only support vectors matter” from a slogan into an observed fact.

How do we find b?

Once you have α\alpha, you can compute bb using any support vector with 0<αi<C0<\alpha_i<C (i.e., on-margin, not at the box constraint). For such a point, ξi=0\xi_i=0 and

yi(wxi+b)=1y_i(\mathbf{w}\cdot\mathbf{x}_i+b)=1

So

b=yiwxib = y_i - \mathbf{w}\cdot\mathbf{x}_i

In practice you average bb over all margin support vectors to reduce numerical noise.

Sparsity and prediction cost

Prediction evaluates

f(x)=iSVαiyi(xix)+bf(\mathbf{x})=\sum_{i\in SV} \alpha_i y_i (\mathbf{x}_i\cdot\mathbf{x}) + b

If the number of support vectors is small, this is fast. This sparsity is a real advantage over methods that require all points at prediction time.

But note the caveat: with some kernels and some CC choices, the number of support vectors can become large (even close to nn), making prediction slower.

Application/Connection: The Kernel Trick (Nonlinear SVMs)

Why kernels

A linear separator in the original input space might be impossible even if the data are “simple” in a different representation.

Classic example: concentric circles. No line separates inner vs outer ring.

Idea: map inputs through a feature map ϕ(x)\phi(\mathbf{x}) so that classes become linearly separable in feature space:

f(x)=wϕ(x)+bf(\mathbf{x}) = \mathbf{w}\cdot\phi(\mathbf{x}) + b

But explicitly constructing ϕ(x)\phi(\mathbf{x}) could be expensive or infinite-dimensional.

The key observation

In the dual, data only appear inside dot products:

xixj\mathbf{x}_i\cdot\mathbf{x}_j

If we instead operate in feature space, we would need:

ϕ(xi)ϕ(xj)\phi(\mathbf{x}_i)\cdot\phi(\mathbf{x}_j)

A kernel is a function

K(x,x)=ϕ(x)ϕ(x)K(\mathbf{x},\mathbf{x}') = \phi(\mathbf{x})\cdot\phi(\mathbf{x}')

for some (possibly implicit) feature map ϕ\phi, provided KK is positive-definite (Mercer condition).

Then the dual becomes:

maxα  iαi12ijαiαjyiyjK(xi,xj)\max_{\boldsymbol{\alpha}}\; \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j \alpha_i\alpha_j y_i y_j K(\mathbf{x}_i,\mathbf{x}_j)

and prediction becomes:

f(x)=iSVαiyiK(xi,x)+bf(\mathbf{x})=\sum_{i\in SV} \alpha_i y_i K(\mathbf{x}_i,\mathbf{x}) + b

That’s the kernel trick: you get nonlinear decision boundaries in input space while solving a convex problem that only needs kernel evaluations.

Common kernels and what they “feel like”

KernelFormulaMain parameter(s)What it implies
LinearK(x,x)=xxK(\mathbf{x},\mathbf{x}')=\mathbf{x}\cdot\mathbf{x}'nonelinear boundary in input space
PolynomialK=(γxx+r)dK=(\gamma\,\mathbf{x}\cdot\mathbf{x}'+r)^dd,γ,rd,\gamma,rinteractions up to degree dd
RBF / Gaussian$K=\exp(-\gamma\\mathbf{x}-\mathbf{x}'\^2)$γ\gammalocal similarity; flexible smooth boundaries
Sigmoid (less common)K=tanh(γxx+r)K=\tanh(\gamma\,\mathbf{x}\cdot\mathbf{x}'+r)γ,r\gamma,rrelated to neural nets; not always PD

A practical intuition for RBF:

  • small γ\gamma → wide Gaussian → smoother, more global influence → simpler boundary
  • large γ\gamma → narrow Gaussian → very local influence → potentially wiggly boundary

And remember: CC and kernel parameters interact. High CC + high γ\gamma can easily overfit.

Visualization focus: concentric circles and feature space linearity

Interactive canvas idea (3): present a concentric-circles dataset.

Panel A (input space): show points and the learned nonlinear boundary for an RBF SVM.

Panel B (a chosen feature space view): show an illustrative mapping where the same points become linearly separable.

For circles, an instructive explicit mapping is to use a radial feature like r2=x12+x22r^2=x_1^2+x_2^2. In a 1D feature space of r2r^2, the classes may separate by a threshold (a “linear” separator in that 1D feature). More generally, you can visualize a 3D feature map:

ϕ(x1,x2)=(x1,x2,x12+x22)\phi(x_1,x_2) = (x_1, x_2, x_1^2+x_2^2)

Then a plane in this 3D space can correspond to a circle-like boundary when projected back to 2D.

Important honesty: the RBF kernel’s actual feature space is infinite-dimensional, so Panel B is an illustration of the concept, not literally the RBF feature map.

Static diagram (for non-canvas readers):

<svg xmlns="http://www.w3.org/2000/svg" width="720" height="300" viewBox="0 0 720 300" role="img" aria-label="Kernel idea: circles not separable in 2D become separable after mapping using radius-squared feature">
  <rect x="0" y="0" width="720" height="300" fill="#fff"/>
  <text x="60" y="30" font-family="sans-serif" font-size="16" fill="#000">Input space (x₁,x₂): concentric circles</text>
  <text x="420" y="30" font-family="sans-serif" font-size="16" fill="#000">Feature (r²): linear threshold</text>
  <!-- left axes -->
  <line x1="60" y1="250" x2="330" y2="250" stroke="#333" stroke-width="2"/>
  <line x1="60" y1="250" x2="60" y2="60" stroke="#333" stroke-width="2"/>
  <!-- circles of points -->
  <circle cx="195" cy="155" r="35" fill="none" stroke="#d62728" stroke-width="2"/>
  <circle cx="195" cy="155" r="80" fill="none" stroke="#2ca02c" stroke-width="2"/>
  <!-- sample points -->
  <circle cx="195" cy="120" r="5" fill="#d62728"/>
  <circle cx="230" cy="155" r="5" fill="#d62728"/>
  <circle cx="195" cy="190" r="5" fill="#d62728"/>
  <circle cx="165" cy="155" r="5" fill="#d62728"/>
  <circle cx="195" cy="75" r="5" fill="#2ca02c"/>
  <circle cx="275" cy="155" r="5" fill="#2ca02c"/>
  <circle cx="195" cy="235" r="5" fill="#2ca02c"/>
  <circle cx="115" cy="155" r="5" fill="#2ca02c"/>
  <!-- right axes for r^2 -->
  <line x1="420" y1="250" x2="690" y2="250" stroke="#333" stroke-width="2"/>
  <line x1="420" y1="250" x2="420" y2="60" stroke="#333" stroke-width="2"/>
  <text x="560" y="275" font-family="sans-serif" font-size="12">r²</text>
  <!-- threshold -->
  <line x1="560" y1="250" x2="560" y2="60" stroke="#1f77b4" stroke-width="2" stroke-dasharray="6,5"/>
  <text x="565" y="80" font-family="sans-serif" font-size="12" fill="#1f77b4">threshold</text>
  <!-- points on line (r^2) -->
  <circle cx="500" cy="170" r="6" fill="#d62728"/>
  <circle cx="510" cy="140" r="6" fill="#d62728"/>
  <circle cx="520" cy="200" r="6" fill="#d62728"/>
  <circle cx="620" cy="120" r="6" fill="#2ca02c"/>
  <circle cx="640" cy="180" r="6" fill="#2ca02c"/>
  <circle cx="650" cy="150" r="6" fill="#2ca02c"/>
  <text x="430" y="55" font-family="sans-serif" font-size="12" fill="#000">(conceptual) mapping: (x₁,x₂) → r²=x₁²+x₂²</text>
</svg>

Model selection: tuning C and kernel parameters

In practice, an SVM’s performance depends heavily on hyperparameters:

  • CC (regularization vs margin violations)
  • kernel parameters (e.g., γ\gamma for RBF, degree dd for polynomial)

A typical approach:

  • standardize features
  • perform cross-validation over a grid (often log-spaced)
  • choose parameters that optimize validation performance

Practical pros/cons (when to use SVMs)

Strengths

  • Convex optimization → no bad local minima
  • Effective in moderate dimensions
  • Kernel trick enables flexible nonlinear boundaries
  • Often strong performance on small-to-medium datasets

Limitations

  • Training can be expensive for very large nn (kernel matrix is n×nn\times n)
  • Prediction can be expensive if many support vectors
  • Less natural probabilistic outputs (though you can calibrate)
  • Kernel and hyperparameter choices matter a lot

Bridge to what you’ll unlock next

Everything about “kernel methods” generalizes beyond SVMs: ridge regression, PCA variants, Gaussian processes, etc. The SVM is your first major encounter with the idea that inner products are the computational interface to a possibly huge feature space.

You’re now ready for: Kernel Methods.

Worked Examples (3)

Example 1: Compute the margin from a given hyperplane (and see why scaling matters)

Suppose a linear classifier is given by w=(3,4)\mathbf{w}=(3,4) and b=10b=-10. Consider the point x=(2,2)\mathbf{x}=(2,2). (1) Compute its signed distance to the decision boundary. (2) Compute the geometric margin band width 2/w2/\|\mathbf{w}\|. (3) Explain why you can’t read off the SVM margin unless the classifier is in canonical scaling.

  1. Compute the norm of w:

    \nw=32+42=9+16=5\|\mathbf{w}\|=\sqrt{3^2+4^2}=\sqrt{9+16}=5.

  2. Compute the signed value of the decision function:

    \nwx+b=(3,4)(2,2)10=(6+8)10=4\mathbf{w}\cdot\mathbf{x}+b = (3,4)\cdot(2,2) -10 = (6+8)-10 = 4.

  3. Convert to signed distance:

    \n$dist(x,w,b)=wx+bw=45=0.8\text{dist}(\mathbf{x},\mathbf{w},b)=\frac{\mathbf{w}\cdot\mathbf{x}+b}{\|\mathbf{w}\|}=\frac{4}{5}=0.8$

    \nSo the point lies on the positive side of the hyperplane, 0.8 units away (in Euclidean distance).

  4. Compute the margin band width (distance between wx+b=+1\mathbf{w}\cdot\mathbf{x}+b=+1 and =1=-1):

    \n$2w=25=0.4\frac{2}{\|\mathbf{w}\|} = \frac{2}{5}=0.4$

  5. Explain scaling:

    \nIf we scale (w,b)(\mathbf{w},b) by any constant k>0k>0, the decision boundary wx+b=0\mathbf{w}\cdot\mathbf{x}+b=0 is unchanged because

    \nkwx+kb=0k\mathbf{w}\cdot\mathbf{x}+kb=0 is equivalent.

    \nBut kw=kw\|k\mathbf{w}\|=k\|\mathbf{w}\|, so 2/w2/\|\mathbf{w}\| would change even though the boundary is the same.

    \nSVM’s canonical scaling fixes this ambiguity by enforcing that the closest points satisfy yi(wxi+b)=1y_i(\mathbf{w}\cdot\mathbf{x}_i+b)=1, making the margin a well-defined geometric quantity.

Insight: The “margin” isn’t just about the boundary line; it’s about the boundary plus a chosen scale. SVM selects the scale by pinning the closest points to functional value ±1, turning 1/w1/\|\mathbf{w}\| into a true geometric distance.

Example 2: From dual coefficients to a classifier (support vectors only)

You trained a (linear) soft-margin SVM and obtained nonzero dual coefficients for only two training points:

  • x1=(1,0)\mathbf{x}_1=(1,0) with y1=+1y_1=+1, α1=0.5\alpha_1=0.5
  • x2=(0,1)\mathbf{x}_2=(0,1) with y2=1y_2=-1, α2=0.5\alpha_2=0.5

Assume b=0b=0. (1) Compute w. (2) Write the decision function f(x)f(\mathbf{x}). (3) Classify x=(2,1)\mathbf{x}=(2,1).

  1. Compute w using w=iαiyixi\mathbf{w}=\sum_i \alpha_i y_i \mathbf{x}_i:

    \nw=0.5(+1)(1,0)+0.5(1)(0,1)\mathbf{w}=0.5\cdot(+1)\cdot(1,0) + 0.5\cdot(-1)\cdot(0,1)

    \nw=(0.5,0)+(0,0.5)=(0.5,0.5)\mathbf{w}=(0.5,0) + (0,-0.5) = (0.5,-0.5).

  2. Write the decision function:

    \nf(x)=wx+b=(0.5,0.5)(x1,x2)f(\mathbf{x})=\mathbf{w}\cdot\mathbf{x}+b = (0.5,-0.5)\cdot(x_1,x_2)

    \nSo

    \n$f(x)=0.5x10.5x2f(\mathbf{x})=0.5x_1 - 0.5x_2$

  3. Evaluate at x=(2,1)\mathbf{x}=(2,1):

    \nf(2,1)=0.520.51=10.5=0.5f(2,1)=0.5\cdot 2 - 0.5\cdot 1 = 1 - 0.5 = 0.5.

  4. Classify using sign:

    \ny^=sign(0.5)=+1\hat{y}=\operatorname{sign}(0.5)=+1.

Insight: Even if you had 10,000 training points, if only two have nonzero α, prediction depends only on those two points. That’s the operational meaning of “support vectors determine the classifier.”

Example 3: Kernelized prediction with an RBF kernel (showing the mechanics)

You have a kernel SVM with two support vectors:

  • x1=(0,0)\mathbf{x}_1=(0,0) with y1=+1y_1=+1, α1=0.8\alpha_1=0.8
  • x2=(1,0)\mathbf{x}_2=(1,0) with y2=1y_2=-1, α2=0.6\alpha_2=0.6

Bias b=0.1b=0.1. Use an RBF kernel K(x,x)=exp(γxx2)K(\mathbf{x},\mathbf{x}')=\exp(-\gamma\|\mathbf{x}-\mathbf{x}'\|^2) with γ=1\gamma=1.

Compute f(x)f(\mathbf{x}) and the predicted label for x=(0.5,0)\mathbf{x}=(0.5,0).

  1. Write the kernel decision function:

    \n$f(x)=iSVαiyiK(xi,x)+bf(\mathbf{x})=\sum_{i\in SV} \alpha_i y_i K(\mathbf{x}_i,\mathbf{x}) + b$

  2. Compute K(x1,x)K(\mathbf{x}_1,\mathbf{x}):

    \nx1x2=(0,0)(0.5,0)2=0.52+02=0.25\|\mathbf{x}_1-\mathbf{x}\|^2 = \|(0,0)-(0.5,0)\|^2 = 0.5^2+0^2=0.25

    \nSo

    \nK(x1,x)=exp(10.25)=e0.25K(\mathbf{x}_1,\mathbf{x})=\exp(-1\cdot 0.25)=e^{-0.25}.

  3. Compute K(x2,x)K(\mathbf{x}_2,\mathbf{x}):

    \nx2x2=(1,0)(0.5,0)2=0.52=0.25\|\mathbf{x}_2-\mathbf{x}\|^2 = \|(1,0)-(0.5,0)\|^2 = 0.5^2=0.25

    \nSo

    \nK(x2,x)=e0.25K(\mathbf{x}_2,\mathbf{x})=e^{-0.25} as well.

  4. Assemble the score:

    \nf(x)=0.8(+1)e0.25+0.6(1)e0.25+0.1f(\mathbf{x})=0.8\cdot(+1)\cdot e^{-0.25} + 0.6\cdot(-1)\cdot e^{-0.25} + 0.1

    \n=(0.80.6)e0.25+0.1=(0.8-0.6)e^{-0.25}+0.1

    \n=0.2e0.25+0.1=0.2e^{-0.25}+0.1.

  5. Numerical approximation: e0.250.7788e^{-0.25}\approx 0.7788.

    \nSo f(x)0.20.7788+0.1=0.1558+0.1=0.2558f(\mathbf{x})\approx 0.2\cdot 0.7788 + 0.1 = 0.1558 + 0.1 = 0.2558.

  6. Predict label:

    \ny^=sign(0.2558)=+1\hat{y}=\operatorname{sign}(0.2558)=+1.

Insight: Kernel SVM prediction looks like a weighted vote of similarities to support vectors. With RBF, each support vector contributes most to nearby points and fades with distance.

Key Takeaways

  • SVMs choose the separating hyperplane that maximizes the margin; under canonical scaling, margin is proportional to 1/w1/\|\mathbf{w}\| (band width 2/w2/\|\mathbf{w}\|).

  • Hard-margin SVM requires perfect separability; soft-margin SVM introduces slack variables ξi\xi_i and trade-off parameter CC.

  • The soft-margin objective corresponds to minimizing 12w2+Cimax(0,1yif(xi))\frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \max(0, 1-y_if(\mathbf{x}_i)) (hinge loss + L2 regularization).

  • In the dual, the solution is expressed by coefficients αi\alpha_i with constraints 0αiC0\le \alpha_i\le C and iαiyi=0\sum_i \alpha_i y_i=0.

  • Only points with αi>0\alpha_i>0 matter at prediction time; these are the support vectors, which “pin” the optimal boundary.

  • The kernel trick replaces dot products with a positive-definite kernel K(x,x)K(\mathbf{x},\mathbf{x}'), enabling nonlinear decision boundaries while keeping convex optimization.

  • Hyperparameters (CC, kernel parameters like RBF γ\gamma) strongly affect bias/variance and the number of support vectors; scaling features is essential.

  • SVMs are powerful for small-to-medium datasets and can be very robust, but kernel SVMs can be costly for very large nn due to the n×nn\times n kernel matrix.

Common Mistakes

  • Confusing the decision boundary scale ambiguity: scaling (w, b) changes w\|\mathbf{w}\| but not the boundary; the SVM margin definition relies on canonical scaling.

  • Forgetting to standardize features before training, causing one feature to dominate inner products and distorting margins and kernel behavior.

  • Assuming all training points influence the solution equally; in SVMs, non-support vectors often have α = 0 and do not affect the classifier.

  • Overfitting with RBF kernels by choosing both large CC and large γ\gamma, producing very wiggly boundaries and many support vectors.

Practice

easy

You are given a hyperplane w=(1,2)\mathbf{w}=(1,2), b=3b=-3. (a) Compute the distance from x=(3,1)\mathbf{x}=(3,1) to the hyperplane. (b) What is the margin band width 2/w2/\|\mathbf{w}\|? (c) If you scale (w, b) by 10, do (a) and (b) change?

Hint: Use dist=(wx+b)/w\text{dist}=(\mathbf{w}\cdot\mathbf{x}+b)/\|\mathbf{w}\| and remember w=w12+w22\|\mathbf{w}\|=\sqrt{w_1^2+w_2^2}.

Show solution

(a) w=12+22=5\|\mathbf{w}\|=\sqrt{1^2+2^2}=\sqrt{5}. Compute wx+b=(1,2)(3,1)3=(3+2)3=2\mathbf{w}\cdot\mathbf{x}+b=(1,2)\cdot(3,1)-3=(3+2)-3=2. Distance =2/5=2/\sqrt{5}.\n\n(b) Band width 2/w=2/52/\|\mathbf{w}\|=2/\sqrt{5}.\n\n(c) Scaling (w, b) by 10 leaves the boundary unchanged. The signed distance to the boundary is unchanged because numerator and denominator both scale by 10. But the expression 2/w2/\|\mathbf{w}\| computed from the scaled w becomes 2/(105)2/(10\sqrt{5}), which shows why margin is only meaningful once canonical scaling is fixed.

medium

A soft-margin SVM solution has three points with coefficients: α1=0\alpha_1=0, α2=0.3\alpha_2=0.3, α3=C\alpha_3=C. (a) Which points are support vectors? (b) Which point is likely violating the margin? (c) What condition must hold among labels yiy_i and coefficients αi\alpha_i?

Hint: Support vectors have αi>0\alpha_i>0. Points with αi=C\alpha_i=C are often inside the margin or misclassified. The dual equality constraint is iαiyi=0\sum_i \alpha_i y_i=0.

Show solution

(a) Points 2 and 3 are support vectors because they have α>0\alpha>0.\n\n(b) Point 3 (with α3=C\alpha_3=C) is likely a margin violator (inside the margin and/or misclassified).\n\n(c) The coefficients must satisfy the dual constraint iαiyi=0\sum_i \alpha_i y_i=0, i.e., α1y1+α2y2+α3y3=0\alpha_1 y_1 + \alpha_2 y_2 + \alpha_3 y_3 = 0.

hard

Consider an RBF kernel K(x,x)=exp(γxx2)K(\mathbf{x},\mathbf{x}')=\exp(-\gamma\|\mathbf{x}-\mathbf{x}'\|^2). (a) What happens to K(x,x)K(\mathbf{x},\mathbf{x}') as γ0\gamma\to 0? (b) What happens as γ\gamma\to\infty for xx\mathbf{x}\ne\mathbf{x}'? (c) How would these extremes affect the flexibility of an SVM decision boundary?

Hint: Use limits of exp(γd2)\exp(-\gamma d^2) as γ\gamma changes; interpret kernel value as similarity/influence.

Show solution

(a) As γ0\gamma\to 0, K(x,x)=exp(γxx2)exp(0)=1K(\mathbf{x},\mathbf{x}')=\exp(-\gamma\|\mathbf{x}-\mathbf{x}'\|^2)\to \exp(0)=1 for all pairs. The kernel matrix becomes approximately all ones (very low effective complexity).\n\n(b) As γ\gamma\to\infty and xx\mathbf{x}\ne\mathbf{x}', xx2>0\|\mathbf{x}-\mathbf{x}'\|^2>0 so γxx2-\gamma\|\mathbf{x}-\mathbf{x}'\|^2\to -\infty and K0K\to 0. Also, K(x,x)=1K(\mathbf{x},\mathbf{x})=1 always. The kernel matrix approaches the identity.\n\n(c) Small γ\gamma makes all points look similar, encouraging very smooth/simple boundaries (can underfit). Very large γ\gamma makes similarity extremely local, allowing highly flexible boundaries that can interpolate noise (risk of overfitting), especially with large CC.

Connections

Next: Kernel Methods

Related nodes you may also connect in your mental map:

  • Linear classifiers (hyperplanes, margins)
  • Convex optimization (quadratic programs)
  • Regularization and loss functions (hinge loss vs logistic)
  • Feature scaling and preprocessing
Quality: A (4.4/5)