Guide

Adversarial attacks in machine learning explained

Harbor Analytics shipped a gradient-boosted fraud classifier in Q1 with 97.2% ROC-AUC on a held-out transaction set and sub-50 ms inference latency. Compliance signed off. Six weeks later, an external red team ran an evasion attack: they perturbed 200 features that the model treated as continuous (normalized amounts, velocity z-scores, merchant embedding distances) within bounds that still passed business-rule sanity checks. On 89% of transactions the model had previously flagged as high-risk, the perturbed inputs scored below the block threshold — effectively laundering fraud through mathematically crafted noise humans could not spot in a spreadsheet. The model had excellent average-case accuracy and near-zero adversarial robustness. Adversarial examples are inputs crafted to fool a model while staying close to legitimate data — often imperceptible in images, subtle in tabular features, or syntactically valid in text. They are not academic curiosities: they threaten spam filters, malware classifiers, content moderation, biometric systems, and autonomous perception. This guide defines adversarial threat models, walks through classic attacks (FGSM, PGD, C&W), explains why high-capacity models remain vulnerable, surveys defenses from adversarial training to certified bounds, works a Harbor Analytics hardening example, provides a defense decision table, lists pitfalls, and ends with a production checklist. For LLM-specific jailbreaks and prompt abuse, see the LLM red teaming guide; this page focuses on classical ML and deep learning evasion.

What adversarial examples are

Given a trained model f and a legitimate input x with true label y, an adversarial example x' satisfies two conditions:

Small perturbation: x' is close to x under a chosen distance metric (L_∞, L₂, or domain-specific constraints).
Misclassification: f(x') ≠ y (untargeted attack) or f(x') = y_target (targeted attack).

In computer vision, perturbations are often bounded so that each pixel changes by at most ε = 8/255 — invisible to humans but enough to flip a ResNet label from “stop sign” to “speed limit 45.” In tabular fraud detection, perturbations might nudge three normalized features by 0.02 standard deviations while keeping total amount and merchant ID unchanged. The attack is an evasion problem: the adversary does not corrupt training data (that is poisoning); they craft inputs at inference time.

Threat model axes

Before choosing defenses, document who the attacker is and what they can access:

White-box — full model weights and gradients. Strongest attacker; PGD and C&W assume this.
Black-box — only query outputs (labels or scores). Attackers use transfer from surrogate models or finite-difference gradient estimates.
Gray-box — architecture known but weights secret, or API rate limits apply.
Knowledge of training data — does the attacker know the distribution? Transfer attacks often succeed with public ImageNet surrogates.
Constraint set — L_∞ box, L₂ ball, or semantic constraints (e.g. valid JSON, plausible transaction amounts).

Production threat models are usually weaker than research white-box settings — but defenses that only resist black-box queries may still fail against insiders or leaked checkpoints.

Classic attack methods

Attacks exploit the fact that neural networks (and many tree ensembles on smooth features) have useful gradients: small steps in input space can maximize loss faster than humans expect.

FGSM — Fast Gradient Sign Method

One-step attack (Goodfellow et al., 2015). For loss L and perturbation budget ε under L_∞:

x' = x + ε · sign(∇_x L(θ, x, y))

Fast and cheap — good for sanity checks and as an inner loop in robust training. Weak against defenses that smooth gradients.

PGD — Projected Gradient Descent

Multi-step FGSM with projection back into the ε-ball each iteration. PGD is the workhorse strong attack in robustness research: if a model survives PGD with 40 steps and random restarts, it is meaningfully harder to fool than one that only resists FGSM. Madry et al. (2018) framed adversarial training as solving a min-max game where the inner maximization is PGD.

C&W and optimization-based attacks

Carlini & Wagner attacks minimize perturbation norm subject to misclassification, often achieving smaller L₂ distortions than PGD. Slower but exposes models that appear robust under weak attacks. Use C&W in security audits when PGD failure rates look suspiciously low.

Universal perturbations and transferability

A single perturbation vector can fool a model on many inputs. Transfer attacks craft examples on a surrogate (e.g. public ResNet) that fool a private production model — critical for black-box APIs. Do not assume secrecy of weights alone is a defense.

Why models are vulnerable

Several mechanisms explain why 97% test accuracy does not imply robustness:

High-dimensional linearity. Locally, ReLU nets behave almost linearly; small gradient-aligned steps accumulate outsized logit changes (Goodfellow's linearity explanation).
Non-robust features. Models learn predictive but brittle shortcuts — textures instead of shape, spurious correlations in tabular data. These flip under tiny shifts.
Overconfidence. Poorly calibrated models assign 99% probability to wrong classes on perturbed inputs, so threshold-based pipelines break silently.
Distribution shift blind spots. Training data rarely includes adversarially optimized points; the model has no incentive to be smooth near decision boundaries.
Ensemble disagreement. Even diverse models often share vulnerable directions attackers can find via transfer.

Standard data augmentation (random crops, noise) helps generalization but is not a substitute for attacks optimized against the specific model.

Defense strategies

Adversarial training

The most effective practical defense: augment each training batch with PGD adversarial examples and minimize loss on both clean and perturbed inputs. Trade-offs are real: clean accuracy often drops 2–5 points on ImageNet; training time increases 3–10×; robustness to unseen attacks is partial, not guaranteed. For tabular fraud models, project perturbations onto feasible ranges (amounts cannot go negative; categorical fields are immutable) before PGD steps.

Input preprocessing and randomization

JPEG compression, bit-depth reduction, random resizing, and feature squeezing can disrupt L_∞ perturbations. Cheap but gradient masking risk: defenses that break gradient computation without improving true robustness. Always evaluate with a white-box attack that differentiates through preprocessing (expectation over transformation attacks).

Detection and fallback

Train a secondary detector on input statistics or activation patterns of adversarial points; route suspicious inputs to human review or conservative rules. Pair with anomaly detection on embedding space distance. Detection alone is insufficient for safety-critical paths but reduces automated exploit success rates.

Certified robustness

Randomized smoothing and interval-bound propagation provide provable guarantees: “no L₂ perturbation with norm ≤ r can change the predicted class.” Guarantees are conservative and often require specialized architectures or heavy compute. Worth pursuing for regulated perception (medical imaging subsets, lane detection) when audits demand proofs, not just empirical PGD numbers.

System-level controls

Rate limits, ensemble disagreement triggers, business-rule gates, and logging of near-boundary scores raise attacker cost. Security is a stack: a robust model behind weak API auth still loses.

Harbor Analytics fraud classifier hardening (worked example)

After the red-team audit, Harbor rebuilt the fraud pipeline in three passes:

Threat model. Assume merchants can probe the real-time scoring API (black-box, 50 queries/minute) and know which features are numeric vs categorical. Perturbation budget: ±0.05σ on z-scored velocity features; amounts fixed to original cents; merchant ID immutable.
Attack benchmark. FGSM and 20-step PGD on a 10k holdout set. Baseline model: 34% attack success rate (ASR) at evading the block threshold. C&W-style L₂ on continuous features: 41% ASR.
Adversarial training. Each epoch, generate PGD examples on numeric features with projection onto feasible ranges; 50/50 clean/adv batch mix. Post-hardening: clean AUC 95.8% (−1.4), PGD ASR 11%, C&W ASR 14%.
Monitoring. Log score deltas when clients retry within 30 seconds with near-identical metadata; flag >15% score drops for analyst review. Added ensemble disagreement rule with a linear baseline model.

Harbor did not claim “secure” — they documented residual ASR, re-audit quarterly, and blocked raw API access for high-risk integrators in favor of server-side feature computation where perturbation surface shrinks.

Defense decision table

Scenario	Recommended approach	Avoid
Image classifier in consumer app	Adversarial training (PGD-7/10); input resize + ensemble; monitor ASR monthly	Obfuscated gradients without white-box retest
Tabular fraud / credit scoring	Domain-constrained PGD training; business-rule gates; server-side features	Unbounded FGSM on categorical columns
Black-box API with public surrogate	Assume transfer attacks; adversarial training + query rate limits + logging	Security through model secrecy alone
Safety-critical perception (medical, AV subset)	Certified smoothing or IBP where feasible; independent red team; PGD-50 eval	FGSM-only robustness claims
Low-stakes recommendation	Light augmentation; periodic PGD spot checks; no full adv training unless abuse observed	Over-investing before measuring ASR
LLM content moderation	See LLM red teaming; paraphrase attacks, token homoglyphs, prompt injection	Image PGD tooling on text pipelines

Common pitfalls

Gradient masking. Defenses that randomize or quantize inputs without re-evaluating with differentiable attacks give false confidence.
Weak attack evaluation. FGSM-only benchmarks overstate robustness; use PGD with multiple restarts and step counts.
Ignoring constraint sets. Unconstrained attacks on tabular data produce impossible transactions that will never appear in production — but also miss real attack surface.
Clean accuracy myopia. Adversarial training trades clean metrics for robustness; stakeholders must accept the trade or fund ensemble fallbacks.
No monitoring. Attackers adapt; ASR must be tracked in production with canary probes, not measured once at launch.
Confusing poisoning and evasion. Different threat models, different mitigations (data provenance vs input validation).
Transfer blind spot. Models that resist white-box PGD on their own weights may still fail on surrogate-crafted examples.
Overconfidence after one hardening pass. Adaptive attackers iterate; robustness is a process, not a checkbox.

Production checklist

Write an explicit adversarial threat model (access, goals, perturbation budget).
Benchmark FGSM, PGD (multi-step, random restarts), and at least one optimization attack (C&W or AutoAttack subset).
Report attack success rate (ASR) alongside clean accuracy / AUC.
Apply domain constraints to perturbations for structured and tabular data.
If deploying adversarial training, budget extra GPU time and validate clean metric regression.
Test whether preprocessing defenses survive white-box attacks through the pipeline.
Add monitoring for score instability, retry patterns, and ensemble disagreement.
Schedule quarterly red-team re-audits or automated robustness regression in CI.
Document residual risk; do not market models as “adversarial-proof.”
Separate LLM prompt attacks from classical evasion in runbooks and tooling.

Key takeaways

Adversarial examples are legitimate-looking inputs crafted to fool models — a distinct threat from poisoning or ordinary distribution shift.
FGSM is a fast probe; PGD is the standard strong attack for evaluating whether defenses are real or masked.
High test accuracy does not imply robustness — models exploit brittle features and overconfident boundaries.
Adversarial training is the most practical deep defense but costs clean accuracy and compute; combine with system controls.
Measure ASR continuously and align perturbation budgets with what real attackers can actually submit.