Guide

Support vector machines (SVM) explained

A support vector machine (SVM) is a supervised classifier that finds the decision boundary with the widest margin between classes. Instead of fitting a probability curve like logistic regression, a linear SVM picks the hyperplane that maximizes the distance to the nearest training points on each side. Those nearest points are the support vectors — they alone determine the boundary; every other training example could be removed without changing the model. When classes are not linearly separable, soft-margin SVM allows controlled misclassification via a penalty parameter C, and the kernel trick lifts features into higher dimensions so a linear separator in that space becomes a curved boundary in the original space. SVMs were dominant on small and medium tabular datasets before gradient-boosted trees took over, and they remain strong baselines for text classification, bioinformatics, and high-dimensional problems with limited samples. This guide walks through the geometry, the hinge loss, kernels, hyperparameters, multiclass extensions, and practical deployment alongside other classical methods in machine learning fundamentals.

Maximum margin: the geometry of a linear SVM

In two dimensions with two classes, many lines can separate red dots from blue dots. Which one should you choose? SVM picks the line whose perpendicular distance to the closest point of each class is as large as possible. That distance is the margin. A wider margin implies the classifier is less sensitive to small perturbations in the training data — a form of structural regularization that predates explicit L2 penalties in neural networks.

Formally, for feature vector x and label y ∈ {+1, -1}, a linear SVM seeks weights w and bias b such that y(w·x + b) ≥ 1 for all training points. Points that sit exactly on the margin (where y(w·x + b) = 1) are support vectors. The decision function is sign(w·x + b); the distance from a point to the hyperplane is proportional to |w·x + b| / ||w||. Maximizing margin is equivalent to minimizing ||w||² subject to the constraints above — a convex quadratic program with a unique global optimum for linearly separable data.

Why only support vectors matter

At inference time, the model is a weighted sum over support vectors: f(x) = ∑ αi yi K(xi, x) + b, where K is the kernel function (dot product for linear SVM). Coefficients αi are zero for non-support-vector points. That sparsity makes SVM memory footprint proportional to the number of support vectors, not the full training set — useful when the boundary is defined by a handful of ambiguous examples near the class frontier.

Hinge loss and soft-margin SVM

Real data is rarely perfectly separable. Noise, label errors, and overlapping class distributions mean a hard-margin SVM would fail or overfit to outliers. Soft-margin SVM relaxes the constraint by introducing slack variables ξi that allow points to fall inside the margin or on the wrong side, penalized in the objective:

Minimize ½||w||² + C ∑ ξi subject to yi(w·xi + b) ≥ 1 - ξi and ξi ≥ 0. The hyperparameter C controls the tradeoff: large C punishes misclassification heavily (narrower margin, risk of overfitting); small C tolerates more errors (wider margin, risk of underfitting). This is the same role as the inverse regularization strength in logistic regression with L2 penalty, but SVM optimizes margin geometry rather than log-likelihood.

The hinge loss for a single example is max(0, 1 - y·f(x)). It is zero when the point is correctly classified beyond the margin, linear when inside the margin, and grows when misclassified. Unlike squared loss, hinge loss stops penalizing well-classified points — another reason the solution depends only on borderline examples.

The kernel trick: nonlinear boundaries without explicit features

Many problems are not linearly separable in the original feature space. The kernel trick computes dot products in a higher-dimensional space without ever constructing that space explicitly. Replace xi·xj with K(xi, xj) in the dual formulation; the optimizer only needs pairwise kernel evaluations.

Common kernels

  • Linear: K(x, x') = x·x' — fastest, best when dimensionality is high and classes are roughly linearly separable (text with TF-IDF).
  • Polynomial: K(x, x') = (γ x·x' + r)d — interaction terms up to degree d; rarely tuned today except in legacy pipelines.
  • RBF (Gaussian): K(x, x') = exp(-γ ||x - x'||²) — the default nonlinear choice; local influence decays with distance. Hyperparameter γ sets the effective radius: large γ fits tight, wiggly boundaries (overfit); small γ smooths the boundary (underfit).
  • Sigmoid: tanh-based; behaves like a small neural net layer but can fail to be positive semi-definite — avoid unless you know why.

Kernel SVM training cost scales between O(n²) and O(n³) in the number of training samples, which is why SVM fell out of favor for million-row datasets but remains competitive on tens of thousands of rows with careful subsampling or linear approximations.

Preprocessing, hyperparameters and model selection

SVM is not scale-invariant. Features measured in dollars and features measured in percentages must be normalized before training — typically standardization (zero mean, unit variance) or min-max scaling. Skipping this step lets large-magnitude features dominate the distance calculations in RBF kernels and distorts the margin in linear SVM. Treat scaling as part of the model pipeline with parameters fit only on the training fold, as described in feature engineering and validated with cross-validation.

Key hyperparameters

Parameter Effect when increased Typical search range
C Harder fit to training data; narrower margin; more support vectors Log-spaced: 0.01 to 100
gamma (RBF) Each point influences a smaller neighborhood; more complex boundary Log-spaced: 1e-4 to 1; or 1 / n_features as default
degree (poly) Higher-order feature interactions 2 to 5 (rarely beyond 3)
class_weight Up-weights minority class errors in imbalanced problems balanced or custom dict

Grid search or Bayesian optimization over C and gamma on a stratified k-fold is standard. Watch for train-test leakage through global scaling. For very large datasets, consider LinearSVC or SGD-based hinge loss instead of full kernel SVM.

Multiclass classification and probability outputs

Binary SVM extends to multiclass via one-vs-rest (train one classifier per class vs all others) or one-vs-one (pairwise classifiers; vote or max score). Scikit-learn's SVC defaults to one-vs-one for computational reasons on small class counts. Neither approach produces calibrated probabilities out of the box — decision values are distance scores, not log-odds. If you need well-calibrated probabilities for threshold tuning or business rules, wrap SVM with Platt scaling (sigmoid fit on cross-validated scores) or use logistic regression instead.

For multi-label problems (each example can belong to several classes), train independent binary SVMs per label — the same pattern as multilabel logistic regression.

SVM vs other classifiers: when to use it

Method Strengths Weaknesses vs SVM
Logistic regression Calibrated probabilities, fast, interpretable coefficients Linear by default; needs manual feature crosses for nonlinear data
Random forest / XGBoost State of art on tabular; handles mixed types; feature importance More hyperparameters; slower on very high-dimensional sparse text unless linear
Naive Bayes Extremely fast baseline for text spam filters Strong independence assumption; usually lower accuracy than SVM on text
Kernel SVM Strong on medium-size nonlinear sets; sparse support vectors Poor scalability; no native probabilities; sensitive to scaling
Linear SVM Fast on high-d sparse data (text, bag-of-words) Cannot capture complex nonlinear structure without feature engineering

Reach for kernel SVM when you have up to roughly 50k samples, moderate feature count, clear nonlinear structure, and need a strong baseline before deep learning. Reach for linear SVM when dimensionality is huge (text classification with n-grams) and training speed matters. On modern tabular benchmarks with mixed numeric and categorical columns, gradient boosting usually wins — but SVM training is a useful sanity check: if boosting barely beats a tuned RBF SVM, your signal may be weak or your features may need work.

Production pitfalls

  • Training-serving skew: Apply the same scaler at inference; store scaler parameters with the model artifact.
  • Latency with many support vectors: Kernel evaluation is O(num_support_vectors × features) per prediction. Prune with nu-SVC or switch to linear SVM if latency budgets are tight.
  • Concept drift: Support vectors anchored to old boundary points may stale; monitor model drift and retrain on schedule.
  • Imbalanced classes: Default SVM optimizes overall margin, not minority recall — use class_weight='balanced' or resampling.
  • Missing probability outputs: Do not treat raw decision function scores as probabilities without calibration.
  • High-dimensional noise: With more features than informative signal, linear SVM with L1 feature selection or PCA preprocessing often beats RBF kernels that overfit noise dimensions.

Decision guide

Situation Recommended approach
Text spam / sentiment, sparse bag-of-words, >100k features Linear SVM or logistic regression with L2
Small tabular set (<10k rows), nonlinear class boundary RBF SVM with grid-searched C and gamma
Need calibrated fraud probability for a cutoff rule Logistic regression or Platt-scaled SVM
Million-row click prediction Linear SVM, logistic regression, or LightGBM — not kernel SVM
Interpretable linear decision with margin robustness Linear SVM — inspect support vectors near boundary for error analysis
Multiclass image classification at scale Deep CNN — SVM on raw pixels is a historical baseline only

Practitioner checklist

  • Standardize or scale numeric features before fitting; fit scaler on train only.
  • Start with linear SVM on high-dimensional sparse data; try RBF only if linear underperforms in CV.
  • Grid-search C and gamma on log scale with stratified k-fold.
  • Check number of support vectors — a large fraction of training set suggests overfitting or noisy labels.
  • Use class_weight or resampling for imbalanced targets; evaluate with precision-recall, not accuracy alone.
  • Calibrate scores with Platt scaling if downstream systems need probabilities.
  • Compare against logistic regression and gradient boosting on the same CV splits.
  • Bundle scaler, SVM model, and label encoder in one serialized pipeline for train-serve parity.
  • Profile inference latency when support vector count exceeds a few thousand.
  • Document kernel choice and hyperparameters — RBF gamma dominates boundary shape.

Key takeaways

  • SVM maximizes margin between classes; only support vectors on the boundary define the decision function.
  • Soft margin + C balances margin width against training errors — the main regularization knob.
  • Kernel trick enables nonlinear boundaries; RBF is the default nonlinear kernel with gamma controlling locality.
  • Scale features and tune with cross-validation; SVM is sensitive to preprocessing mistakes.
  • Know the limits: kernel SVM does not scale to huge datasets and does not emit calibrated probabilities without extra steps.

Related reading