Support Vector Machines

Machine LearningDifficulty: ████░Depth: 9Unlocks: 1

Maximum margin classifiers. Kernel trick for nonlinearity.

Interactive Visualization

t=0s

Core Concepts

▸Maximum-margin separating hyperplane (decision boundary w·x+b=0 with canonical scaling so margin = 1/||w||)
▸Support vectors and sparsity (only training points with active constraints determine the classifier)
▸Kernel trick (use of a positive-definite kernel as an implicit inner product to enable nonlinear separation)

Key Symbols & Notation

alpha_i (dual coefficient / Lagrange multiplier for training example i)K(x,x') (positive-definite kernel function = inner product in feature space)

Essential Relationships

↔Primal-dual and kernelized decision function: w = sum_i alpha_i y_i x_i, so f(x)=sign(sum_i alpha_i y_i K(x_i,x)+b); only alpha_i>0 (support vectors) contribute and margin = 1/||w|| under canonical scaling

Prerequisites (2)

Convex Optimization5 atoms

Lagrange Multipliers5 atoms

Unlocks (1)

Kernel Methodslvl 4

▶ Advanced Learning Details

Graph Position

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (26)

• Linear separating hyperplane as classifier: decision boundary defined by w·x + b = 0
• Geometric margin: signed distance from a point to the separating hyperplane
• Maximum-margin principle: choose hyperplane that maximizes the minimum (geometric) margin
• Support vectors: training examples that lie on or inside the margin and determine the solution
• Hard-margin SVM: maximum-margin classifier when data are linearly separable (no errors allowed)
• Soft-margin SVM: margin maximization allowing classification errors via slack variables
• Slack variables (ξ_i): nonnegative variables measuring margin violations for each training point
• Regularization parameter C: trade-off parameter between margin size and slack (misclassification) penalty
• Hinge loss: loss function max(0, 1 - y f(x)) that underlies soft-margin SVM
• Primal SVM optimization problem: objective and constraints in w, b, and ξ (quadratic objective plus linear constraints)
• Dual SVM optimization problem: quadratic program in dual variables (α_i) involving only inner products of training points
• Representation theorem for SVMs: primal weight vector w expressed as a linear combination of training points weighted by α_i y_i
• Kernel function K(x, x'): a function that computes an inner product in some (possibly high- or infinite-dimensional) feature space
• Kernel trick: replace inner products ⟨φ(x), φ(x')⟩ with K(x,x') to train non-linear SVMs without explicit feature mapping
• Kernel (Gram) matrix: matrix of pairwise kernel evaluations K_ij = K(x_i, x_j)
• Positive semi-definiteness / Mercer condition for kernels: condition that a function be a valid inner-product kernel
• Common kernel families and their qualitative effects (linear, polynomial, Gaussian/RBF, sigmoid)
• Dual-to-primal link: bias term b and decision function recovered from dual solution (α_i)
• Sparsity of dual solution: only support vectors have nonzero α_i, leading to sparse decision function
• KKT conditions specialized to SVM: complementary slackness implications relating α_i, ξ_i, and classification margin
• Bounds on dual variables in soft-margin: 0 ≤ α_i ≤ C and equality constraint sum_i α_i y_i = 0
• Decision function in kernelized form: f(x) = sign( sum_i α_i y_i K(x_i, x) + b )
• Interpretation of C as inverse regularization strength and its effect on margin/generalization
• Equivalence/relationship between minimizing ||w|| (or ||w||^2) and maximizing margin
• Influence of kernel choice on implicit feature space dimensionality (e.g., RBF → infinite-dimensional)
• Computational implications: training scales with number of training examples (quadratic/greater for dense kernels); prediction cost scales with number of support vectors

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Support Vector Machines (SVMs) are one of the cleanest examples of how geometry, convex optimization, and linear algebra combine into a powerful ML algorithm: pick the separating hyperplane that leaves the widest safety buffer (margin) between classes—and if the data isn’t linearly separable, quietly switch to a richer feature space using a kernel, without ever computing those features explicitly.

TL;DR:

An SVM chooses a decision boundary $\mathbf{w}\cdot\mathbf{x}+b=0$ that maximizes the margin (roughly, the distance to the closest training points). Only the closest points—support vectors—determine the solution, producing a sparse model in the dual form. With the kernel trick, inner products $\phi(\mathbf{x})\cdot\phi(\mathbf{x}')$ are replaced by a positive-definite kernel $K(\mathbf{x},\mathbf{x}')$ , enabling nonlinear decision boundaries while solving a convex optimization problem.

What Is a Support Vector Machine?

The problem SVMs are trying to solve (why before how)

In binary classification you often want a rule that separates two classes as reliably as possible. If the data are roughly linearly separable, many hyperplanes can separate them—but some are fragile: a tiny perturbation in the data could flip predictions.

SVMs add a strong geometric preference:

•Don’t just separate the classes.
•Separate them with the largest possible margin.

That “margin” is a built-in robustness buffer. Intuitively, if the boundary is far from the training points, small noise in the inputs is less likely to cross the boundary.

The decision boundary

A linear classifier uses a hyperplane:

\mathbf{w}\cdot\mathbf{x}+b=0

•w is the normal vector (perpendicular to the boundary)
• $b$ is the offset (bias)
•prediction is typically $\hat{y}=\operatorname{sign}(\mathbf{w}\cdot\mathbf{x}+b)$ with labels $y\in\{+1,-1\}$

What “margin” means (geometrically)

For a point x, its signed distance to the hyperplane is:

\text{dist}(\mathbf{x}, \mathbf{w}, b)=\frac{\mathbf{w}\cdot\mathbf{x}+b}{\|\mathbf{w}\|}

So the distance scales like $1/\|\mathbf{w}\|$ . To talk about margins in a consistent way, SVMs use canonical scaling: choose the scale of w and $b$ so that the closest points satisfy

y_i(\mathbf{w}\cdot\mathbf{x}_i+b)=1

Under that choice:

•the two “margin” hyperplanes are

\mathbf{w}\cdot\mathbf{x}+b=+1 \quad\text{and}\quad \mathbf{w}\cdot\mathbf{x}+b=-1

•the distance between them is

\frac{2}{\|\mathbf{w}\|}

Many texts call the margin $\frac{1}{\|\mathbf{w}\|}$ (distance from boundary to the closest points), while others emphasize the full band width $\frac{2}{\|\mathbf{w}\|}$ . Either way, maximizing the margin is equivalent to minimizing $\|\mathbf{w}\|$ .

The “maximum margin” optimization (hard-margin)

If the data are perfectly separable, we want:

•correct classification with margin constraints:

y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\ge 1 \quad \forall i

•maximum margin, i.e. minimize $\|\mathbf{w}\|$ .

This gives the classic hard-margin SVM primal problem:

\min_{\mathbf{w},b}\; \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{s.t.}\quad y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\ge 1\;\forall i

Why $\frac{1}{2}\|\mathbf{w}\|^2$ ? It’s convex and differentiable, and the factor $1/2$ cancels nicely in derivatives.

Visualization focus: margin vs decision boundary

Interactive canvas idea (1): show a 2D dataset with two classes. Provide sliders to rotate/translate a candidate hyperplane and display:

•the decision boundary $\mathbf{w}\cdot\mathbf{x}+b=0$
•the two margin lines $\mathbf{w}\cdot\mathbf{x}+b=\pm 1$
•the current margin width $2/\|\mathbf{w}\|$

Let the user move the boundary and watch the margin shrink/grow, with a live readout of $\|\mathbf{w}\|$ .

Static diagram (for non-canvas readers):

<svg xmlns="http://www.w3.org/2000/svg" width="720" height="260" viewBox="0 0 720 260" role="img" aria-label="SVM margin: decision boundary and two margin lines with support vectors">
  <rect x="0" y="0" width="720" height="260" fill="#ffffff"/>
  <!-- axes -->
  <line x1="60" y1="220" x2="680" y2="220" stroke="#333" stroke-width="2"/>
  <line x1="60" y1="220" x2="60" y2="30" stroke="#333" stroke-width="2"/>
  <!-- decision boundary and margins -->
  <line x1="170" y1="220" x2="510" y2="30" stroke="#1f77b4" stroke-width="3"/>
  <line x1="140" y1="220" x2="480" y2="30" stroke="#1f77b4" stroke-width="2" stroke-dasharray="8,6"/>
  <line x1="200" y1="220" x2="540" y2="30" stroke="#1f77b4" stroke-width="2" stroke-dasharray="8,6"/>
  <text x="520" y="60" font-family="sans-serif" font-size="14" fill="#1f77b4">w·x+b=0</text>
  <text x="545" y="80" font-family="sans-serif" font-size="12" fill="#1f77b4">w·x+b=+1</text>
  <text x="455" y="90" font-family="sans-serif" font-size="12" fill="#1f77b4">w·x+b=-1</text>
  <!-- points (class +1) -->
  <circle cx="520" cy="70" r="7" fill="#2ca02c"/>
  <circle cx="600" cy="110" r="7" fill="#2ca02c"/>
  <circle cx="610" cy="60" r="7" fill="#2ca02c"/>
  <!-- points (class -1) -->
  <circle cx="150" cy="185" r="7" fill="#d62728"/>
  <circle cx="210" cy="170" r="7" fill="#d62728"/>
  <circle cx="250" cy="205" r="7" fill="#d62728"/>
  <!-- support vectors (highlighted) -->
  <circle cx="520" cy="70" r="12" fill="none" stroke="#000" stroke-width="2"/>
  <circle cx="210" cy="170" r="12" fill="none" stroke="#000" stroke-width="2"/>
  <text x="90" y="45" font-family="sans-serif" font-size="14" fill="#000">Support vectors lie on the dashed margin lines</text>
</svg>

This diagram emphasizes a key idea you’ll return to: the boundary is “pinned” by the closest points.

Core Mechanic 1: Maximum-Margin Optimization and the Soft Margin

Why the hard-margin version is not enough

Perfect separability is rare. Noise, overlap, and mislabeled points are common.

If you insist on $y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\ge 1$ for every point, you may get:

•infeasibility (no solution)
•or an overly complex boundary in feature space (when kernels are used)

SVMs handle this with slack variables $\xi_i\ge 0$ that allow violations:

y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\ge 1-\xi_i

Interpretation:

• $\xi_i=0$ : point is correctly classified and outside/on the margin.
• $0<\xi_i<1$ : correctly classified but inside the margin.
• $\xi_i\ge 1$ : misclassified.

The soft-margin primal objective

We now trade off large margin vs. violations:

\min_{\mathbf{w},b,\boldsymbol{\xi}}\; \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^n \xi_i \quad\text{s.t.}\quad y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\ge 1-\xi_i,\;\xi_i\ge 0

$C>0$ controls the trade-off:

•large $C$ : violations are expensive → narrower margin, fewer training errors
•small $C$ : violations are tolerated → wider margin, possibly more training errors

A helpful way to remember this:

• $\frac{1}{2}\|\mathbf{w}\|^2$ is a capacity/complexity penalty
• $\sum \xi_i$ is a training loss (linear penalty on margin violations)

Connecting to hinge loss

You can rewrite the constrained soft-margin problem into an unconstrained form using the hinge loss

\ell_{\text{hinge}}(y, f)=\max(0, 1-y f)

where $f(\mathbf{x})=\mathbf{w}\cdot\mathbf{x}+b$ .

At optimum, $\xi_i$ becomes exactly the hinge loss:

\xi_i = \max\big(0, 1-y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\big)

So the primal is equivalent to:

\min_{\mathbf{w},b}\; \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^n \max\big(0, 1-y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\big)

This is a useful lens because it makes SVMs feel like “regularized empirical risk minimization,” just with hinge loss instead of logistic loss.

Visualization focus: how C changes the solution

Interactive canvas idea: a slider for $C$ that recomputes the separating line (in 2D linear case) and updates:

•margin width $2/\|\mathbf{w}\|$
•number of margin violations
•which points are support vectors

Learners should see that increasing $C$ often pulls the boundary toward outliers to fix them, shrinking the margin.

A careful note about scaling

Because SVMs depend on inner products and distances, feature scaling matters.

If one feature has values in thousands and another in tenths, the large-scale feature dominates $\mathbf{w}\cdot\mathbf{x}$ and $\|\mathbf{w}\|$ . Standard practice:

•standardize features (zero mean, unit variance) or similar normalization
•tune $C$ (and kernel params) after scaling

Where convex optimization shows up

Both hard- and soft-margin SVMs are convex problems:

•quadratic objective in w
•linear constraints in (w, $b$ , $\xi$ )

That convexity is why SVMs historically earned a reputation for reliability: there is a single global optimum (up to degeneracies).

But the most elegant part is what happens when we transform the problem into its dual: it will reveal support vectors and the kernel trick naturally.

Core Mechanic 2: Support Vectors, the Dual Problem, and Sparsity

Why we go to the dual

You already know Lagrange multipliers for equality constraints. For SVMs we have inequality constraints, so we use the Karush–Kuhn–Tucker (KKT) framework.

The payoff for deriving the dual is big:

1)The classifier can be written entirely in terms of dot products $\mathbf{x}_i\cdot\mathbf{x}_j$ .
2)The solution becomes sparse: only some points have nonzero coefficients.
3)That dot-product-only form is exactly what kernels replace.

We’ll derive the soft-margin dual (hard-margin is a special case).

Step 1: Set up constraints and multipliers

Primal (soft-margin) again:

\min_{\mathbf{w},b,\boldsymbol{\xi}}\; \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \xi_i

subject to

y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\ge 1-\xi_i, \quad \xi_i\ge 0

Introduce Lagrange multipliers:

• $\alpha_i\ge 0$ for the margin constraints
• $\mu_i\ge 0$ for the slack nonnegativity constraints $\xi_i\ge 0$

The Lagrangian is:

\mathcal{L}(\mathbf{w},b,\boldsymbol{\xi},\boldsymbol{\alpha},\boldsymbol{\mu}) = \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \xi_i - \sum_i \alpha_i\big(y_i(\mathbf{w}\cdot\mathbf{x}_i+b)-1+\xi_i\big) - \sum_i \mu_i\xi_i

Step 2: Stationarity conditions (minimize over primal variables)

Take partial derivatives and set to zero.

With respect to w:

\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \mathbf{w} - \sum_i \alpha_i y_i \mathbf{x}_i = 0 \;\Rightarrow\; \mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i

This is the first major result: w is a linear combination of training points.

With respect to $b$ :

\frac{\partial \mathcal{L}}{\partial b} = -\sum_i \alpha_i y_i = 0 \;\Rightarrow\; \sum_i \alpha_i y_i = 0

With respect to $\xi_i$ :

\frac{\partial \mathcal{L}}{\partial \xi_i} = C - \alpha_i - \mu_i = 0 \;\Rightarrow\; \alpha_i + \mu_i = C

Since $\mu_i\ge 0$ , we get the box constraint:

0 \le \alpha_i \le C

Step 3: Plug back in → the dual objective

Substitute $\mathbf{w}=\sum_i \alpha_i y_i \mathbf{x}_i$ into the Lagrangian and eliminate $\boldsymbol{\xi}$ using stationarity. After simplification, the dual becomes:

\max_{\boldsymbol{\alpha}}\; \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \alpha_i\alpha_j y_i y_j (\mathbf{x}_i\cdot\mathbf{x}_j)

subject to

0\le \alpha_i\le C, \quad \sum_i \alpha_i y_i=0

This is a convex quadratic program in $\boldsymbol{\alpha}$ (maximize a concave quadratic).

Step 4: The classifier in terms of α

Once you solve for $\alpha_i$ , the decision function is:

f(\mathbf{x})=\mathbf{w}\cdot\mathbf{x}+b =\Big(\sum_i \alpha_i y_i \mathbf{x}_i\Big)\cdot\mathbf{x}+b =\sum_i \alpha_i y_i (\mathbf{x}_i\cdot\mathbf{x}) + b

Only points with $\alpha_i\ne 0$ contribute. These are the support vectors.

What exactly are “support vectors”?

From KKT complementary slackness:

\alpha_i\big(y_i(\mathbf{w}\cdot\mathbf{x}_i+b)-1+\xi_i\big)=0

So:

•If a point is comfortably outside the margin: $y_i(\mathbf{w}\cdot\mathbf{x}_i+b)>1$ and $\xi_i=0$ → typically $\alpha_i=0$ .
•If a point lies exactly on the margin: $y_i(\mathbf{w}\cdot\mathbf{x}_i+b)=1$ and $\xi_i=0$ → $0<\alpha_i<C$ .
•If a point violates the margin: $y_i(\mathbf{w}\cdot\mathbf{x}_i+b)<1$ → usually $\alpha_i=C$ (at the upper bound) when it’s a “hard” violator.

This yields a practical taxonomy:

Point location	Condition	Typical α	Role
Outside margin	$y f(\mathbf{x})>1$	0	irrelevant to boundary
On margin	$y f(\mathbf{x})=1$	(0, C)	“geometric” support vector
Inside margin / misclassified	$y f(\mathbf{x})<1$	C	“error” support vector

Visualization focus: support vectors control the boundary

Interactive canvas idea (2): show a solved linear SVM in 2D with support vectors highlighted. Allow the user to:

1)drag any non-support point a moderate amount
2)drag a support vector a moderate amount

and recompute the SVM.

Expected visual lesson:

•moving non-support points often does not change the boundary (or changes it very little)
•moving a support vector noticeably moves/rotates the boundary

To make this explicit, display:

•list/count of support vectors
•α values next to points (e.g., tiny labels)
•“boundary change” metric (angle shift, bias shift)

This turns “only support vectors matter” from a slogan into an observed fact.

How do we find b?

Once you have $\alpha$ , you can compute $b$ using any support vector with $0<\alpha_i<C$ (i.e., on-margin, not at the box constraint). For such a point, $\xi_i=0$ and

y_i(\mathbf{w}\cdot\mathbf{x}_i+b)=1

b = y_i - \mathbf{w}\cdot\mathbf{x}_i

In practice you average $b$ over all margin support vectors to reduce numerical noise.

Sparsity and prediction cost

Prediction evaluates

f(\mathbf{x})=\sum_{i\in SV} \alpha_i y_i (\mathbf{x}_i\cdot\mathbf{x}) + b

If the number of support vectors is small, this is fast. This sparsity is a real advantage over methods that require all points at prediction time.

But note the caveat: with some kernels and some $C$ choices, the number of support vectors can become large (even close to $n$ ), making prediction slower.

Application/Connection: The Kernel Trick (Nonlinear SVMs)

Why kernels

A linear separator in the original input space might be impossible even if the data are “simple” in a different representation.

Classic example: concentric circles. No line separates inner vs outer ring.

Idea: map inputs through a feature map $\phi(\mathbf{x})$ so that classes become linearly separable in feature space:

f(\mathbf{x}) = \mathbf{w}\cdot\phi(\mathbf{x}) + b

But explicitly constructing $\phi(\mathbf{x})$ could be expensive or infinite-dimensional.

The key observation

In the dual, data only appear inside dot products:

\mathbf{x}_i\cdot\mathbf{x}_j

If we instead operate in feature space, we would need:

\phi(\mathbf{x}_i)\cdot\phi(\mathbf{x}_j)

A kernel is a function

K(\mathbf{x},\mathbf{x}') = \phi(\mathbf{x})\cdot\phi(\mathbf{x}')

for some (possibly implicit) feature map $\phi$ , provided $K$ is positive-definite (Mercer condition).

Then the dual becomes:

\max_{\boldsymbol{\alpha}}\; \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j \alpha_i\alpha_j y_i y_j K(\mathbf{x}_i,\mathbf{x}_j)

and prediction becomes:

f(\mathbf{x})=\sum_{i\in SV} \alpha_i y_i K(\mathbf{x}_i,\mathbf{x}) + b

That’s the kernel trick: you get nonlinear decision boundaries in input space while solving a convex problem that only needs kernel evaluations.

Common kernels and what they “feel like”

Kernel	Formula	Main parameter(s)	What it implies
Linear	$K(\mathbf{x},\mathbf{x}')=\mathbf{x}\cdot\mathbf{x}'$	none	linear boundary in input space
Polynomial	$K=(\gamma\,\mathbf{x}\cdot\mathbf{x}'+r)^d$	$d,\gamma,r$	interactions up to degree $d$
RBF / Gaussian	$K=\exp(-\gamma\	\mathbf{x}-\mathbf{x}'\	^2)$	$\gamma$	local similarity; flexible smooth boundaries
Sigmoid (less common)	$K=\tanh(\gamma\,\mathbf{x}\cdot\mathbf{x}'+r)$	$\gamma,r$	related to neural nets; not always PD

A practical intuition for RBF:

•small $\gamma$ → wide Gaussian → smoother, more global influence → simpler boundary
•large $\gamma$ → narrow Gaussian → very local influence → potentially wiggly boundary

And remember: $C$ and kernel parameters interact. High $C$ + high $\gamma$ can easily overfit.

Visualization focus: concentric circles and feature space linearity

Interactive canvas idea (3): present a concentric-circles dataset.

Panel A (input space): show points and the learned nonlinear boundary for an RBF SVM.

Panel B (a chosen feature space view): show an illustrative mapping where the same points become linearly separable.

For circles, an instructive explicit mapping is to use a radial feature like $r^2=x_1^2+x_2^2$ . In a 1D feature space of $r^2$ , the classes may separate by a threshold (a “linear” separator in that 1D feature). More generally, you can visualize a 3D feature map:

\phi(x_1,x_2) = (x_1, x_2, x_1^2+x_2^2)

Then a plane in this 3D space can correspond to a circle-like boundary when projected back to 2D.

Important honesty: the RBF kernel’s actual feature space is infinite-dimensional, so Panel B is an illustration of the concept, not literally the RBF feature map.

Static diagram (for non-canvas readers):

<svg xmlns="http://www.w3.org/2000/svg" width="720" height="300" viewBox="0 0 720 300" role="img" aria-label="Kernel idea: circles not separable in 2D become separable after mapping using radius-squared feature">
  <rect x="0" y="0" width="720" height="300" fill="#fff"/>
  <text x="60" y="30" font-family="sans-serif" font-size="16" fill="#000">Input space (x₁,x₂): concentric circles</text>
  <text x="420" y="30" font-family="sans-serif" font-size="16" fill="#000">Feature (r²): linear threshold</text>
  <!-- left axes -->
  <line x1="60" y1="250" x2="330" y2="250" stroke="#333" stroke-width="2"/>
  <line x1="60" y1="250" x2="60" y2="60" stroke="#333" stroke-width="2"/>
  <!-- circles of points -->
  <circle cx="195" cy="155" r="35" fill="none" stroke="#d62728" stroke-width="2"/>
  <circle cx="195" cy="155" r="80" fill="none" stroke="#2ca02c" stroke-width="2"/>
  <!-- sample points -->
  <circle cx="195" cy="120" r="5" fill="#d62728"/>
  <circle cx="230" cy="155" r="5" fill="#d62728"/>
  <circle cx="195" cy="190" r="5" fill="#d62728"/>
  <circle cx="165" cy="155" r="5" fill="#d62728"/>
  <circle cx="195" cy="75" r="5" fill="#2ca02c"/>
  <circle cx="275" cy="155" r="5" fill="#2ca02c"/>
  <circle cx="195" cy="235" r="5" fill="#2ca02c"/>
  <circle cx="115" cy="155" r="5" fill="#2ca02c"/>
  <!-- right axes for r^2 -->
  <line x1="420" y1="250" x2="690" y2="250" stroke="#333" stroke-width="2"/>
  <line x1="420" y1="250" x2="420" y2="60" stroke="#333" stroke-width="2"/>
  <text x="560" y="275" font-family="sans-serif" font-size="12">r²</text>
  <!-- threshold -->
  <line x1="560" y1="250" x2="560" y2="60" stroke="#1f77b4" stroke-width="2" stroke-dasharray="6,5"/>
  <text x="565" y="80" font-family="sans-serif" font-size="12" fill="#1f77b4">threshold</text>
  <!-- points on line (r^2) -->
  <circle cx="500" cy="170" r="6" fill="#d62728"/>
  <circle cx="510" cy="140" r="6" fill="#d62728"/>
  <circle cx="520" cy="200" r="6" fill="#d62728"/>
  <circle cx="620" cy="120" r="6" fill="#2ca02c"/>
  <circle cx="640" cy="180" r="6" fill="#2ca02c"/>
  <circle cx="650" cy="150" r="6" fill="#2ca02c"/>
  <text x="430" y="55" font-family="sans-serif" font-size="12" fill="#000">(conceptual) mapping: (x₁,x₂) → r²=x₁²+x₂²</text>
</svg>

Model selection: tuning C and kernel parameters

In practice, an SVM’s performance depends heavily on hyperparameters:

• $C$ (regularization vs margin violations)
•kernel parameters (e.g., $\gamma$ for RBF, degree $d$ for polynomial)

A typical approach:

•standardize features
•perform cross-validation over a grid (often log-spaced)
•choose parameters that optimize validation performance

Practical pros/cons (when to use SVMs)

Strengths

•Convex optimization → no bad local minima
•Effective in moderate dimensions
•Kernel trick enables flexible nonlinear boundaries
•Often strong performance on small-to-medium datasets

Limitations

•Training can be expensive for very large $n$ (kernel matrix is $n\times n$ )
•Prediction can be expensive if many support vectors
•Less natural probabilistic outputs (though you can calibrate)
•Kernel and hyperparameter choices matter a lot

Bridge to what you’ll unlock next

Everything about “kernel methods” generalizes beyond SVMs: ridge regression, PCA variants, Gaussian processes, etc. The SVM is your first major encounter with the idea that inner products are the computational interface to a possibly huge feature space.

You’re now ready for: Kernel Methods.

Worked Examples (3)

Example 1: Compute the margin from a given hyperplane (and see why scaling matters)

Suppose a linear classifier is given by $\mathbf{w}=(3,4)$ and $b=-10$ . Consider the point $\mathbf{x}=(2,2)$ . (1) Compute its signed distance to the decision boundary. (2) Compute the geometric margin band width $2/\|\mathbf{w}\|$ . (3) Explain why you can’t read off the SVM margin unless the classifier is in canonical scaling.

Compute the norm of w:
\n $\|\mathbf{w}\|=\sqrt{3^2+4^2}=\sqrt{9+16}=5$ .
Compute the signed value of the decision function:
\n $\mathbf{w}\cdot\mathbf{x}+b = (3,4)\cdot(2,2) -10 = (6+8)-10 = 4$ .
Convert to signed distance:
\n$ $\text{dist}(\mathbf{x},\mathbf{w},b)=\frac{\mathbf{w}\cdot\mathbf{x}+b}{\|\mathbf{w}\|}=\frac{4}{5}=0.8$ $
\nSo the point lies on the positive side of the hyperplane, 0.8 units away (in Euclidean distance).
Compute the margin band width (distance between $\mathbf{w}\cdot\mathbf{x}+b=+1$ and $=-1$ ):
\n$ $\frac{2}{\|\mathbf{w}\|} = \frac{2}{5}=0.4$ $
Explain scaling:
\nIf we scale $(\mathbf{w},b)$ by any constant $k>0$ , the decision boundary $\mathbf{w}\cdot\mathbf{x}+b=0$ is unchanged because
\n $k\mathbf{w}\cdot\mathbf{x}+kb=0$ is equivalent.
\nBut $\|k\mathbf{w}\|=k\|\mathbf{w}\|$ , so $2/\|\mathbf{w}\|$ would change even though the boundary is the same.
\nSVM’s canonical scaling fixes this ambiguity by enforcing that the closest points satisfy $y_i(\mathbf{w}\cdot\mathbf{x}_i+b)=1$ , making the margin a well-defined geometric quantity.

Insight: The “margin” isn’t just about the boundary line; it’s about the boundary plus a chosen scale. SVM selects the scale by pinning the closest points to functional value ±1, turning $1/\|\mathbf{w}\|$ into a true geometric distance.

Example 2: From dual coefficients to a classifier (support vectors only)

You trained a (linear) soft-margin SVM and obtained nonzero dual coefficients for only two training points:

• $\mathbf{x}_1=(1,0)$ with $y_1=+1$ , $\alpha_1=0.5$
• $\mathbf{x}_2=(0,1)$ with $y_2=-1$ , $\alpha_2=0.5$

Assume $b=0$ . (1) Compute w. (2) Write the decision function $f(\mathbf{x})$ . (3) Classify $\mathbf{x}=(2,1)$ .

Compute w using $\mathbf{w}=\sum_i \alpha_i y_i \mathbf{x}_i$ :
\n $\mathbf{w}=0.5\cdot(+1)\cdot(1,0) + 0.5\cdot(-1)\cdot(0,1)$
\n $\mathbf{w}=(0.5,0) + (0,-0.5) = (0.5,-0.5)$ .
Write the decision function:
\n $f(\mathbf{x})=\mathbf{w}\cdot\mathbf{x}+b = (0.5,-0.5)\cdot(x_1,x_2)$
\nSo
\n$ $f(\mathbf{x})=0.5x_1 - 0.5x_2$ $
Evaluate at $\mathbf{x}=(2,1)$ :
\n $f(2,1)=0.5\cdot 2 - 0.5\cdot 1 = 1 - 0.5 = 0.5$ .
Classify using sign:
\n $\hat{y}=\operatorname{sign}(0.5)=+1$ .

Insight: Even if you had 10,000 training points, if only two have nonzero α, prediction depends only on those two points. That’s the operational meaning of “support vectors determine the classifier.”

Example 3: Kernelized prediction with an RBF kernel (showing the mechanics)

You have a kernel SVM with two support vectors:

• $\mathbf{x}_1=(0,0)$ with $y_1=+1$ , $\alpha_1=0.8$
• $\mathbf{x}_2=(1,0)$ with $y_2=-1$ , $\alpha_2=0.6$

Bias $b=0.1$ . Use an RBF kernel $K(\mathbf{x},\mathbf{x}')=\exp(-\gamma\|\mathbf{x}-\mathbf{x}'\|^2)$ with $\gamma=1$ .

Compute $f(\mathbf{x})$ and the predicted label for $\mathbf{x}=(0.5,0)$ .

Write the kernel decision function:
\n$ $f(\mathbf{x})=\sum_{i\in SV} \alpha_i y_i K(\mathbf{x}_i,\mathbf{x}) + b$ $
Compute $K(\mathbf{x}_1,\mathbf{x})$ :
\n $\|\mathbf{x}_1-\mathbf{x}\|^2 = \|(0,0)-(0.5,0)\|^2 = 0.5^2+0^2=0.25$
\nSo
\n $K(\mathbf{x}_1,\mathbf{x})=\exp(-1\cdot 0.25)=e^{-0.25}$ .
Compute $K(\mathbf{x}_2,\mathbf{x})$ :
\n $\|\mathbf{x}_2-\mathbf{x}\|^2 = \|(1,0)-(0.5,0)\|^2 = 0.5^2=0.25$
\nSo
\n $K(\mathbf{x}_2,\mathbf{x})=e^{-0.25}$ as well.
Assemble the score:
\n $f(\mathbf{x})=0.8\cdot(+1)\cdot e^{-0.25} + 0.6\cdot(-1)\cdot e^{-0.25} + 0.1$
\n $=(0.8-0.6)e^{-0.25}+0.1$
\n $=0.2e^{-0.25}+0.1$ .
Numerical approximation: $e^{-0.25}\approx 0.7788$ .
\nSo $f(\mathbf{x})\approx 0.2\cdot 0.7788 + 0.1 = 0.1558 + 0.1 = 0.2558$ .
Predict label:
\n $\hat{y}=\operatorname{sign}(0.2558)=+1$ .

Insight: Kernel SVM prediction looks like a weighted vote of similarities to support vectors. With RBF, each support vector contributes most to nearby points and fades with distance.

Key Takeaways

✓
SVMs choose the separating hyperplane that maximizes the margin; under canonical scaling, margin is proportional to $1/\|\mathbf{w}\|$ (band width $2/\|\mathbf{w}\|$ ).
✓
Hard-margin SVM requires perfect separability; soft-margin SVM introduces slack variables $\xi_i$ and trade-off parameter $C$ .
✓
The soft-margin objective corresponds to minimizing $\frac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \max(0, 1-y_if(\mathbf{x}_i))$ (hinge loss + L2 regularization).
✓
In the dual, the solution is expressed by coefficients $\alpha_i$ with constraints $0\le \alpha_i\le C$ and $\sum_i \alpha_i y_i=0$ .
✓
Only points with $\alpha_i>0$ matter at prediction time; these are the support vectors, which “pin” the optimal boundary.
✓
The kernel trick replaces dot products with a positive-definite kernel $K(\mathbf{x},\mathbf{x}')$ , enabling nonlinear decision boundaries while keeping convex optimization.
✓
Hyperparameters ( $C$ , kernel parameters like RBF $\gamma$ ) strongly affect bias/variance and the number of support vectors; scaling features is essential.
✓
SVMs are powerful for small-to-medium datasets and can be very robust, but kernel SVMs can be costly for very large $n$ due to the $n\times n$ kernel matrix.

Common Mistakes

✗
Confusing the decision boundary scale ambiguity: scaling (w, b) changes $\|\mathbf{w}\|$ but not the boundary; the SVM margin definition relies on canonical scaling.
✗
Forgetting to standardize features before training, causing one feature to dominate inner products and distorting margins and kernel behavior.
✗
Assuming all training points influence the solution equally; in SVMs, non-support vectors often have α = 0 and do not affect the classifier.
✗
Overfitting with RBF kernels by choosing both large $C$ and large $\gamma$ , producing very wiggly boundaries and many support vectors.

Practice

easy

You are given a hyperplane $\mathbf{w}=(1,2)$ , $b=-3$ . (a) Compute the distance from $\mathbf{x}=(3,1)$ to the hyperplane. (b) What is the margin band width $2/\|\mathbf{w}\|$ ? (c) If you scale (w, b) by 10, do (a) and (b) change?

Hint: Use $\text{dist}=(\mathbf{w}\cdot\mathbf{x}+b)/\|\mathbf{w}\|$ and remember $\|\mathbf{w}\|=\sqrt{w_1^2+w_2^2}$ .

Show solution

(a) $\|\mathbf{w}\|=\sqrt{1^2+2^2}=\sqrt{5}$ . Compute $\mathbf{w}\cdot\mathbf{x}+b=(1,2)\cdot(3,1)-3=(3+2)-3=2$ . Distance $=2/\sqrt{5}$ .\n\n(b) Band width $2/\|\mathbf{w}\|=2/\sqrt{5}$ .\n\n(c) Scaling (w, b) by 10 leaves the boundary unchanged. The signed distance to the boundary is unchanged because numerator and denominator both scale by 10. But the expression $2/\|\mathbf{w}\|$ computed from the scaled w becomes $2/(10\sqrt{5})$ , which shows why margin is only meaningful once canonical scaling is fixed.

medium

A soft-margin SVM solution has three points with coefficients: $\alpha_1=0$ , $\alpha_2=0.3$ , $\alpha_3=C$ . (a) Which points are support vectors? (b) Which point is likely violating the margin? (c) What condition must hold among labels $y_i$ and coefficients $\alpha_i$ ?

Hint: Support vectors have $\alpha_i>0$ . Points with $\alpha_i=C$ are often inside the margin or misclassified. The dual equality constraint is $\sum_i \alpha_i y_i=0$ .

Show solution

(a) Points 2 and 3 are support vectors because they have $\alpha>0$ .\n\n(b) Point 3 (with $\alpha_3=C$ ) is likely a margin violator (inside the margin and/or misclassified).\n\n(c) The coefficients must satisfy the dual constraint $\sum_i \alpha_i y_i=0$ , i.e., $\alpha_1 y_1 + \alpha_2 y_2 + \alpha_3 y_3 = 0$ .

hard

Consider an RBF kernel $K(\mathbf{x},\mathbf{x}')=\exp(-\gamma\|\mathbf{x}-\mathbf{x}'\|^2)$ . (a) What happens to $K(\mathbf{x},\mathbf{x}')$ as $\gamma\to 0$ ? (b) What happens as $\gamma\to\infty$ for $\mathbf{x}\ne\mathbf{x}'$ ? (c) How would these extremes affect the flexibility of an SVM decision boundary?

Hint: Use limits of $\exp(-\gamma d^2)$ as $\gamma$ changes; interpret kernel value as similarity/influence.

Show solution

(a) As $\gamma\to 0$ , $K(\mathbf{x},\mathbf{x}')=\exp(-\gamma\|\mathbf{x}-\mathbf{x}'\|^2)\to \exp(0)=1$ for all pairs. The kernel matrix becomes approximately all ones (very low effective complexity).\n\n(b) As $\gamma\to\infty$ and $\mathbf{x}\ne\mathbf{x}'$ , $\|\mathbf{x}-\mathbf{x}'\|^2>0$ so $-\gamma\|\mathbf{x}-\mathbf{x}'\|^2\to -\infty$ and $K\to 0$ . Also, $K(\mathbf{x},\mathbf{x})=1$ always. The kernel matrix approaches the identity.\n\n(c) Small $\gamma$ makes all points look similar, encouraging very smooth/simple boundaries (can underfit). Very large $\gamma$ makes similarity extremely local, allowing highly flexible boundaries that can interpolate noise (risk of overfitting), especially with large $C$ .

Connections

Next: Kernel Methods

Related nodes you may also connect in your mental map:

•Linear classifiers (hyperplanes, margins)
•Convex optimization (quadratic programs)
•Regularization and loss functions (hinge loss vs logistic)
•Feature scaling and preprocessing

Quality: A (4.4/5)

← back to tree browse all →