Guide
Support vector machines (SVM) explained
A support vector machine (SVM) is a supervised classifier that
finds the decision boundary with the widest margin between classes. Instead of
fitting a probability curve like
logistic regression,
a linear SVM picks the hyperplane that maximizes the distance to the nearest
training points on each side. Those nearest points are the
support vectors — they alone determine the boundary; every other
training example could be removed without changing the model. When classes are not
linearly separable, soft-margin SVM allows controlled misclassification via a
penalty parameter C, and the kernel trick lifts
features into higher dimensions so a linear separator in that space becomes a
curved boundary in the original space. SVMs were dominant on small and
medium tabular datasets before gradient-boosted trees took over, and they remain
strong baselines for text classification, bioinformatics, and high-dimensional
problems with limited samples. This guide walks through the geometry, the hinge
loss, kernels, hyperparameters, multiclass extensions, and practical deployment
alongside other classical methods in
machine learning fundamentals.
Maximum margin: the geometry of a linear SVM
In two dimensions with two classes, many lines can separate red dots from blue dots. Which one should you choose? SVM picks the line whose perpendicular distance to the closest point of each class is as large as possible. That distance is the margin. A wider margin implies the classifier is less sensitive to small perturbations in the training data — a form of structural regularization that predates explicit L2 penalties in neural networks.
Formally, for feature vector x and label y ∈ {+1, -1},
a linear SVM seeks weights w and bias b such that
y(w·x + b) ≥ 1 for all training points. Points that sit
exactly on the margin (where y(w·x + b) = 1) are support
vectors. The decision function is sign(w·x + b); the
distance from a point to the hyperplane is proportional to
|w·x + b| / ||w||. Maximizing margin is equivalent to
minimizing ||w||² subject to the constraints above — a convex
quadratic program with a unique global optimum for linearly separable data.
Why only support vectors matter
At inference time, the model is a weighted sum over support vectors:
f(x) = ∑ αi yi K(xi, x) + b,
where K is the kernel function (dot product for linear SVM). Coefficients
αi are zero for non-support-vector points. That
sparsity makes SVM memory footprint proportional to the number of support vectors,
not the full training set — useful when the boundary is defined by a handful of
ambiguous examples near the class frontier.
Hinge loss and soft-margin SVM
Real data is rarely perfectly separable. Noise, label errors, and overlapping
class distributions mean a hard-margin SVM would fail or overfit to outliers.
Soft-margin SVM relaxes the constraint by introducing slack
variables ξi that allow points to fall inside the
margin or on the wrong side, penalized in the objective:
Minimize ½||w||² + C ∑ ξi subject to
yi(w·xi + b) ≥ 1 - ξi
and ξi ≥ 0. The hyperparameter
C controls the tradeoff: large C punishes misclassification
heavily (narrower margin, risk of overfitting); small C tolerates more errors
(wider margin, risk of underfitting). This is the same role as the inverse
regularization strength in
logistic regression
with L2 penalty, but SVM optimizes margin geometry rather than log-likelihood.
The hinge loss for a single example is
max(0, 1 - y·f(x)). It is zero when the point is correctly
classified beyond the margin, linear when inside the margin, and grows when
misclassified. Unlike squared loss, hinge loss stops penalizing well-classified
points — another reason the solution depends only on borderline examples.
The kernel trick: nonlinear boundaries without explicit features
Many problems are not linearly separable in the original feature space. The
kernel trick computes dot products in a higher-dimensional
space without ever constructing that space explicitly. Replace
xi·xj with K(xi, xj)
in the dual formulation; the optimizer only needs pairwise kernel evaluations.
Common kernels
- Linear:
K(x, x') = x·x'— fastest, best when dimensionality is high and classes are roughly linearly separable (text with TF-IDF). - Polynomial:
K(x, x') = (γ x·x' + r)d— interaction terms up to degreed; rarely tuned today except in legacy pipelines. - RBF (Gaussian):
K(x, x') = exp(-γ ||x - x'||²)— the default nonlinear choice; local influence decays with distance. Hyperparameterγsets the effective radius: largeγfits tight, wiggly boundaries (overfit); smallγsmooths the boundary (underfit). - Sigmoid: tanh-based; behaves like a small neural net layer but can fail to be positive semi-definite — avoid unless you know why.
Kernel SVM training cost scales between O(n²) and
O(n³) in the number of training samples, which is why SVM
fell out of favor for million-row datasets but remains competitive on tens of
thousands of rows with careful subsampling or linear approximations.
Preprocessing, hyperparameters and model selection
SVM is not scale-invariant. Features measured in dollars and features measured in percentages must be normalized before training — typically standardization (zero mean, unit variance) or min-max scaling. Skipping this step lets large-magnitude features dominate the distance calculations in RBF kernels and distorts the margin in linear SVM. Treat scaling as part of the model pipeline with parameters fit only on the training fold, as described in feature engineering and validated with cross-validation.
Key hyperparameters
| Parameter | Effect when increased | Typical search range |
|---|---|---|
C |
Harder fit to training data; narrower margin; more support vectors | Log-spaced: 0.01 to 100 |
gamma (RBF) |
Each point influences a smaller neighborhood; more complex boundary | Log-spaced: 1e-4 to 1; or 1 / n_features as default |
degree (poly) |
Higher-order feature interactions | 2 to 5 (rarely beyond 3) |
class_weight |
Up-weights minority class errors in imbalanced problems | balanced or custom dict |
Grid search or Bayesian optimization over C and gamma
on a stratified k-fold is standard. Watch for train-test leakage through
global scaling. For very large datasets, consider
LinearSVC or SGD-based hinge loss instead of full kernel SVM.
Multiclass classification and probability outputs
Binary SVM extends to multiclass via one-vs-rest (train one classifier per
class vs all others) or one-vs-one (pairwise classifiers; vote or max score).
Scikit-learn's SVC defaults to one-vs-one for computational reasons
on small class counts. Neither approach produces calibrated probabilities out of
the box — decision values are distance scores, not log-odds. If you need
well-calibrated probabilities for threshold tuning or business rules, wrap SVM
with Platt scaling (sigmoid fit on cross-validated scores) or use logistic
regression instead.
For multi-label problems (each example can belong to several classes), train independent binary SVMs per label — the same pattern as multilabel logistic regression.
SVM vs other classifiers: when to use it
| Method | Strengths | Weaknesses vs SVM |
|---|---|---|
| Logistic regression | Calibrated probabilities, fast, interpretable coefficients | Linear by default; needs manual feature crosses for nonlinear data |
| Random forest / XGBoost | State of art on tabular; handles mixed types; feature importance | More hyperparameters; slower on very high-dimensional sparse text unless linear |
| Naive Bayes | Extremely fast baseline for text spam filters | Strong independence assumption; usually lower accuracy than SVM on text |
| Kernel SVM | Strong on medium-size nonlinear sets; sparse support vectors | Poor scalability; no native probabilities; sensitive to scaling |
| Linear SVM | Fast on high-d sparse data (text, bag-of-words) | Cannot capture complex nonlinear structure without feature engineering |
Reach for kernel SVM when you have up to roughly 50k samples, moderate feature count, clear nonlinear structure, and need a strong baseline before deep learning. Reach for linear SVM when dimensionality is huge (text classification with n-grams) and training speed matters. On modern tabular benchmarks with mixed numeric and categorical columns, gradient boosting usually wins — but SVM training is a useful sanity check: if boosting barely beats a tuned RBF SVM, your signal may be weak or your features may need work.
Production pitfalls
- Training-serving skew: Apply the same scaler at inference; store scaler parameters with the model artifact.
- Latency with many support vectors: Kernel evaluation is
O(num_support_vectors × features) per prediction. Prune with
nu-SVCor switch to linear SVM if latency budgets are tight. - Concept drift: Support vectors anchored to old boundary points may stale; monitor model drift and retrain on schedule.
- Imbalanced classes: Default SVM optimizes overall margin,
not minority recall — use
class_weight='balanced'or resampling. - Missing probability outputs: Do not treat raw decision function scores as probabilities without calibration.
- High-dimensional noise: With more features than informative signal, linear SVM with L1 feature selection or PCA preprocessing often beats RBF kernels that overfit noise dimensions.
Decision guide
| Situation | Recommended approach |
|---|---|
| Text spam / sentiment, sparse bag-of-words, >100k features | Linear SVM or logistic regression with L2 |
| Small tabular set (<10k rows), nonlinear class boundary | RBF SVM with grid-searched C and gamma |
| Need calibrated fraud probability for a cutoff rule | Logistic regression or Platt-scaled SVM |
| Million-row click prediction | Linear SVM, logistic regression, or LightGBM — not kernel SVM |
| Interpretable linear decision with margin robustness | Linear SVM — inspect support vectors near boundary for error analysis |
| Multiclass image classification at scale | Deep CNN — SVM on raw pixels is a historical baseline only |
Practitioner checklist
- Standardize or scale numeric features before fitting; fit scaler on train only.
- Start with linear SVM on high-dimensional sparse data; try RBF only if linear underperforms in CV.
- Grid-search
Candgammaon log scale with stratified k-fold. - Check number of support vectors — a large fraction of training set suggests overfitting or noisy labels.
- Use
class_weightor resampling for imbalanced targets; evaluate with precision-recall, not accuracy alone. - Calibrate scores with Platt scaling if downstream systems need probabilities.
- Compare against logistic regression and gradient boosting on the same CV splits.
- Bundle scaler, SVM model, and label encoder in one serialized pipeline for train-serve parity.
- Profile inference latency when support vector count exceeds a few thousand.
- Document kernel choice and hyperparameters — RBF gamma dominates boundary shape.
Key takeaways
- SVM maximizes margin between classes; only support vectors on the boundary define the decision function.
- Soft margin + C balances margin width against training errors — the main regularization knob.
- Kernel trick enables nonlinear boundaries; RBF is the default nonlinear kernel with gamma controlling locality.
- Scale features and tune with cross-validation; SVM is sensitive to preprocessing mistakes.
- Know the limits: kernel SVM does not scale to huge datasets and does not emit calibrated probabilities without extra steps.
Related reading
- Logistic regression explained — probabilistic linear baseline with interpretable coefficients
- Naive Bayes explained — fast probabilistic classifier for text and sparse features
- Decision trees and random forests explained — nonlinear tree ensembles for tabular data
- Overfitting and cross-validation explained — validation discipline for hyperparameter search