Guide

Support vector machines (SVM) explained

A fraud analyst draws a line through a scatter plot separating legitimate charges from suspicious ones. Two lines might achieve the same training accuracy — but one hugs the data while another leaves a wide buffer on both sides. Which generalizes better when new transactions arrive? Support vector machines (SVMs), developed by Vapnik and colleagues in the 1990s, formalize that intuition: find the maximum-margin hyperplane that separates classes with the widest possible gap. Only the points nearest the boundary — the support vectors — determine where the line sits. When data is not linearly separable, the kernel trick maps features into higher dimensions without explicitly computing the transformation. SVMs powered early text classifiers, bioinformatics, and image recognition before deep learning dominated unstructured data. They remain strong baselines on small-to-medium tabular datasets with careful feature scaling. This guide covers linear and soft-margin geometry, hinge loss, common kernels, hyperparameter tuning, multiclass extensions, a Harbor Analytics fraud scorer worked example, a model comparison table, pitfalls, and a production checklist — building on machine learning fundamentals alongside logistic regression and tree ensembles.

What an SVM optimizes

Given labeled training points in feature space, a linear SVM seeks a hyperplane w · x + b = 0 that separates positive and negative classes while maximizing the margin — the perpendicular distance between the hyperplane and the nearest points of each class. The margin width is inversely related to ||w||; minimizing ||w|| subject to correct classification yields the widest gap.

Hard margin vs soft margin

Hard-margin SVM requires every training point to fall on the correct side with at least unit distance from the boundary. This works only when data is perfectly separable and is sensitive to outliers — one mislabeled point can rotate the entire hyperplane.

Soft-margin SVM introduces slack variables ξ_i that allow some points to violate the margin or even cross to the wrong side. The objective trades margin width against total slack, controlled by the regularization parameter C. Large C penalizes misclassifications heavily (narrow margin, fewer errors on training data); small C tolerates more violations (wider margin, smoother boundary). In practice, always use soft margin unless you have verified separability and pristine labels.

Hinge loss intuition

Soft-margin SVM is equivalent to minimizing ||w||² + C ∑ hinge(y_i, w · x_i + b), where hinge loss is zero when a point is correctly classified beyond the margin and grows linearly when it is inside the margin or misclassified. Compare this to log-loss in logistic regression, which penalizes all points continuously. Hinge loss is sparse: points far from the boundary contribute zero gradient, so only support vectors matter at convergence.

Support vectors and the decision function

After training, the classifier depends only on support vectors S:

f(x) = sign(∑_i∈S α_i y_i K(x_i, x) + b)

For a linear kernel K(x_i, x) = x_i · x, this reduces to a weighted sum of dot products with support vectors — still a linear boundary, but expressed through dual coefficients α_i. Nonzero α_i identify support vectors. Inspecting them can reveal which training examples anchor the boundary (useful for debugging mislabeled outliers).

Unlike logistic regression, standard SVMs do not natively output calibrated probabilities. Libraries like scikit-learn offer probability=True via Platt scaling on a held-out fold — treat those probabilities cautiously for cost-sensitive thresholds.

The kernel trick

Real data is rarely linearly separable in the original feature space. Kernels compute inner products in an implicit high-dimensional space without building the transformation explicitly. The decision function depends on K(x_i, x_j) rather than raw coordinates.

Common kernels

Kernel	Formula (intuition)	Typical use
Linear	K(x, x') = x · x'	High-dimensional sparse text (bag-of-words), already rich features
Polynomial	(γ x · x' + r)^d	Interaction terms up to degree d; less common today
RBF (Gaussian)	exp(−γ \|\|x − x'\|\|²)	Default nonlinear choice on low-dimensional tabular data
Sigmoid	tanh(γ x · x' + r)	Historically used; often behaves like a small neural net layer

The RBF kernel is the workhorse for nonlinear classification. Parameter γ controls the influence radius of each training point. Large γ creates tight, wiggly boundaries that can overfit; small γ approaches a nearly linear separator. Pair γ tuning with C via grid search or Bayesian optimization on a validation set.

When to stay linear: if you have thousands of features and millions of sparse examples (text classification with TF-IDF), a linear SVM often matches or beats RBF while training faster and scaling better.

Hyperparameters and training

Key knobs for practitioners:

C — regularization strength. Tune on log scale (e.g. 0.01, 0.1, 1, 10, 100). High C fits training data tightly; low C smooths.
γ (RBF/poly) — kernel width. Often searched as 1 / (n_features × Var(X)) as a heuristic starting point, then refined.
degree d (polynomial) — rarely tuned above 3; higher degrees explode compute and overfit.
class_weight — set to balanced or custom dict when positives are rare (fraud, disease). Complements class imbalance techniques.

Training complexity scales between O(n²) and O(n³) in the number of training samples for classic solvers — painful beyond tens of thousands of rows. For larger datasets, consider linear SVM with stochastic solvers (LinearSVC, SGDClassifier with hinge loss) or switch to gradient boosting. Use cross-validation for hyperparameter search; never tune on the test set.

Multiclass extensions

Binary SVMs extend to K classes via one-vs-rest (OvR) — train K classifiers each separating one class from all others — or one-vs-one (OvO) with K(K−1)/2 pairwise classifiers and majority vote. OvR is the default in most libraries and matches how multiclass logistic regression is deployed. OvO can help when classes are imbalanced in pairwise comparisons but costs more to train.

Worked example: Harbor Analytics card fraud scorer

Harbor Analytics builds a real-time fraud scorer for a regional card issuer. Training set: 48,000 authorized transactions (0.8% fraud) with 22 numeric features — transaction amount, merchant category, velocity counts in 1h/24h windows, device fingerprint mismatch flags, and geodesic distance from home location. Labels come from chargebacks filed within 60 days.

Pipeline steps:

Drop transactions with missing geolocation; impute velocity zeros for new cards.
Fit StandardScaler on training features only — critical because amount ($5 vs $5,000) and distance (miles) share a feature vector with count features.
Stratified 80/20 split preserving fraud rate; 5-fold CV on training for C and γ grid on RBF SVM.
Best params: C = 10, γ = 0.01. Validation ROC-AUC = 0.91; precision at 0.5 threshold = 0.34 (acceptable given 0.8% base rate with human review queue).

The model identifies 127 support vectors out of 38,400 training points — mostly borderline fraud cases near legitimate high-velocity travel patterns. Analysts review those vectors and find 9 mislabeled training rows (legitimate vacation spend marked fraud). Relabeling and retraining lifts validation AUC to 0.93 without changing hyperparameters.

Harbor compares against gradient-boosted trees (AUC 0.94, slower inference) and logistic regression (AUC 0.88, faster but misses nonlinear velocity interactions). They ship the RBF SVM for the edge deployment path where latency budget is 8 ms and model size must stay under 2 MB — support vectors serialize compactly. Probabilities for ranking alerts use Platt scaling fit on the validation fold.

Model comparison table

Algorithm	Strengths	Weaknesses vs SVM
Linear SVM	Max-margin, sparse support vectors, fast on sparse high-dim text	—
RBF SVM	Flexible nonlinear boundaries on small tabular sets	—
Logistic regression	Calibrated probabilities, fast training at scale	Linear unless you engineer features; no max-margin inductive bias
Gradient-boosted trees	Often best tabular accuracy; handles mixed types	Slower inference; larger model artifacts; may overfit small sets without care
k-nearest neighbors	Zero training time; intuitive	Slow inference; curse of dimensionality; needs scaling
Neural networks	Dominate vision, language, large unstructured data	Need more data and tuning; opaque on small tabular tasks

Practical rule: on <50k rows and <100 numeric features, compare linear SVM, RBF SVM, logistic regression, and gradient boosting with proper CV. On millions of rows, start with linear models or trees; reserve kernel SVM for subsets or feature-rich problems where margin interpretability matters.

Common pitfalls

Skipping feature scaling — RBF and polynomial kernels are not scale-invariant. Distance-based geometry breaks when one feature is in dollars and another in counts.
Tuning on test data — grid search must stay inside training/validation; leaking test labels inflates reported AUC.
Using RBF on huge datasets — O(n²) memory for kernel matrices stalls pipelines; subsample or use linear approximations.
Ignoring class imbalance — default SVM optimizes overall accuracy; rare fraud classes get neglected without weights or resampling.
Trusting default C and γ — library defaults are rarely optimal; always cross-validate.
Expecting native probabilities — decision scores are not log-odds; calibrate before using scores as risk rankings in regulated contexts.
Outliers in hard-margin mode — never use hard margin on noisy real-world labels.
High-dimensional sparse text without linear kernel — RBF on 50k-dimensional bag-of-words is computationally punishing with little benefit.

Production checklist

Audit labels for misclassification near the decision boundary (support vectors).
Scale or normalize features; persist the fitted scaler with the model artifact.
Stratified train/validation/test split; use CV on training for C, γ, kernel choice.
Set class_weight or sample weights when positives are rare.
Evaluate with ROC-AUC, precision-recall, and cost-weighted metrics — not accuracy alone.
Calibrate probabilities if scores drive business thresholds (Platt or isotonic on validation).
Serialize support vectors, dual coefficients, kernel params, and scaler via scikit-learn joblib or ONNX conversion.
Benchmark inference latency on production hardware; count support vectors affects speed.
Monitor feature drift; retrain when velocity distributions shift (new merchant categories).
Document kernel and hyperparameter choices for compliance audits.

Key takeaways

SVMs maximize margin — the boundary is determined by support vectors, not every training point.
Soft margin + C balances fit vs generalization; hinge loss zeroes out easy examples.
Kernels enable nonlinear boundaries without explicit feature expansion; RBF is the default nonlinear choice.
Scale features before RBF/poly kernels; linear SVM often wins on sparse high-dimensional text.
Compare against logistic regression and boosting on tabular tasks — SVMs are strong but not automatic winners.