Guide

Anomaly detection explained

A payment processor flags a $47,000 wire from a dormant account. A factory sensor reports vibration 3x above its Tuesday-morning baseline. API latency spikes for six minutes while error rates stay flat — then both climb together. Each case is an anomaly: an observation or pattern that deviates enough from expected behavior to warrant attention. Anomaly detection is the discipline of finding those deviations automatically — in batch datasets, streaming metrics, and security logs — before they become outages, fraud losses, or safety incidents. Unlike standard supervised classification, anomalies are often rare, labels are scarce, and the cost of false alarms competes with the cost of missed detection. This guide covers anomaly types, statistical baselines, modern ML methods including isolation forest and autoencoders, time-series-specific approaches, evaluation on imbalanced data, and how to deploy detectors without drowning on-call engineers in noise.

What counts as an anomaly

An anomaly (or outlier) is a data point, event, or sequence that is inconsistent with the majority of observed behavior. The definition is contextual: a 200 ms API response is normal at 2 p.m. and alarming at 2 a.m. when traffic is minimal. Researchers usually distinguish three shapes:

Point anomalies — a single value far from the norm (a transaction amount 50 standard deviations above a user's history).
Contextual anomalies — normal in one context, abnormal in another (high CPU at peak shopping hour vs the same CPU on a Sunday night).
Collective anomalies — individual points look fine, but the sequence is wrong (a slow credential-stuffing attack where each login attempt is plausible alone).

Anomaly detection overlaps with but differs from clustering: clustering finds natural groups; anomaly detection finds the points that do not belong to any dense group, or that violate a learned "normal" manifold. It also differs from forecasting: a forecaster predicts the next value; an anomaly detector asks whether the observed value is surprising relative to a model of normality.

Where anomaly detection is used

The pattern appears wherever rare bad events hide inside high-volume normal traffic:

Fraud and abuse — card-not-present fraud, account takeover, bot traffic, insider trading patterns.
IT operations and SRE — latency spikes, error-rate jumps, disk-fill curves, memory leaks visible only as slow drift.
Manufacturing and IoT — bearing vibration, temperature excursions, predictive maintenance before equipment failure.
Security — impossible travel logins, unusual DNS queries, lateral movement in network flows.
Finance and markets — fat-finger trades, wash-trading rings, sudden liquidity gaps.
Healthcare — vital-sign alerts, billing-code outliers, equipment calibration drift.

In each domain the detector feeds an alerting or review queue. Humans or downstream automation decide what to block, page, or ignore. The detector's job is to rank suspicion — not to be right every time.

Statistical baselines — start here

Before training neural networks, simple statistics often catch 80% of obvious problems with minutes of implementation. They also make excellent fallback baselines when ML models drift or lack training data.

Z-score and modified z-score

For approximately normal features, flag points where |x - mean| / std exceeds a threshold (commonly 3). The modified z-score uses the median and median absolute deviation (MAD), which resists contamination by existing outliers — important when you are estimating "normal" from data that already contains anomalies.

Interquartile range (IQR)

Compute Q1, Q3, and IQR = Q3 - Q1. Flag values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. Non-parametric, works on skewed distributions, and is the box-plot rule many analysts already know.

Control charts and SPC

Statistical Process Control tracks a metric over time with upper and lower control limits derived from historical variance. Western Electric rules add pattern detection (e.g. nine consecutive points above the mean). SPC is underused in software metrics but powerful for stable batch pipelines.

Statistical methods assume stationarity or manually defined seasons. They break when "normal" shifts — Black Friday traffic is not an anomaly — unless you condition on context (day-of-week, campaign flag) or use time-series-specific methods below.

Machine learning approaches

When features are high-dimensional, correlated, or non-Gaussian, ML detectors learn a boundary around "normal" from mostly clean historical data.

Isolation Forest

Isolation Forest randomly splits feature space with axis-aligned cuts. Anomalies are easier to isolate — they tend to be separated in fewer splits than dense normal points. The algorithm scores each point by average path length across an ensemble of random trees. It scales to large tabular datasets, needs no labels, and handles moderate dimensionality well. It is often the first ML baseline for fraud and log analytics.

One-Class SVM and support vector data description

One-Class SVM learns a tight boundary around normal training data in kernel space. Works when normal data is tightly clustered; struggles when "normal" is multimodal (many distinct legitimate behaviors). Training cost grows with sample size, so subsampling or feature reduction is common.

Autoencoders

An autoencoder compresses input through a bottleneck and reconstructs it. Train on normal data only; at inference, high reconstruction error signals anomaly. Variational autoencoders (VAEs) model uncertainty in the latent space. Autoencoders shine on images, sensor spectra, and embedding vectors where isolation forest's axis-aligned splits are weak. Watch for "normal" modes the bottleneck cannot represent — legitimate novelty gets flagged as anomaly.

Density estimation

Gaussian Mixture Models and kernel density estimation assign a likelihood to each point; low likelihood = anomaly. Local Outlier Factor (LOF) compares local density of a point to its neighbors — useful when global density is uniform but local neighborhoods vary.

Supervised and semi-supervised hybrids

When you have some labeled fraud or incident history, train a classifier — but expect severe class imbalance. Techniques from precision-recall evaluation apply: threshold tuning, cost-sensitive learning, and oversampling rare positives. Semi-supervised approaches train on mostly normal data plus a small labeled anomaly set to calibrate scores. In production, many teams run an unsupervised scorer first and a supervised re-ranker on the top-K alerts.

Time series anomalies

Metrics indexed by time need detectors that respect ordering and seasonality. Random shuffling for train/test splits leaks future information — the same discipline as time series forecasting applies.

Residual-based detection

Fit a forecaster (ARIMA, Prophet, seasonal naive, or gradient boosting with lag features). Flag points where the absolute residual exceeds a dynamic threshold — often a rolling multiple of residual standard deviation. This separates expected seasonal peaks from true surprises.

Streaming windows

Maintain rolling mean, variance, and quantiles over the last N minutes or events. Compare the current bucket to the window. Exponential weighted moving averages (EWMA) react faster to recent shifts; longer windows resist single-spike noise.

Matrix profiles and discord discovery

For multivariate or shape-based anomalies, matrix profile algorithms find subsequences that are maximally different from their nearest neighbor in the same series — effective for collective anomalies like sensor glitches or repeated micro-stutters in latency.

Multivariate correlation breaks

CPU and request rate usually move together. An anomaly may be a normal CPU reading paired with abnormally low traffic — visible only when features are scored jointly (Mahalanobis distance, copula models, or multivariate autoencoders).

Evaluation — why accuracy lies

If 0.1% of transactions are fraudulent, a model that always predicts "normal" achieves 99.9% accuracy and catches zero fraud. Anomaly detection is inherently imbalanced. Report precision, recall, F1, and especially precision at top-K — of the 100 highest-scored alerts, how many are real incidents analysts would confirm?

Ground truth is messy: many "anomalies" are never labeled, and yesterday's anomaly becomes today's normal after a product launch. Use time-based holdouts: train on January–March, evaluate April, simulating production lag. Track alert volume and analyst confirmation rate in live shadow mode before paging anyone.

Cost asymmetry matters. A missed fraud case may cost $10,000; a false alert costs five minutes of analyst time. Encode that ratio when choosing thresholds — not a universal 0.5 cutoff on a probability score.

Production deployment

Batch vs streaming

Batch jobs re-score yesterday's logs nightly — fine for compliance review. Streaming detectors score each event or metric bucket within seconds — required for fraud blocks and SRE paging. Stream processors need state (rolling windows, learned centroids) with checkpointed recovery after restarts.

Thresholds, hysteresis, and alert fatigue

A fixed threshold on a drifting score produces either silence or noise. Use percentile-based thresholds calibrated on recent normal data, per-segment baselines (per merchant, per region), and hysteresis — fire when score crosses high threshold, clear only when it drops below a lower threshold. Group related alerts into incidents; suppress duplicates for the same root cause.

Human feedback loops

Analysts who mark alerts "true positive" or "false positive" generate labels for retraining. Without this loop, detectors decay as products and attackers evolve — the same concept drift problem that affects supervised models.

Observability integration

Anomaly scores should appear alongside raw metrics in your metrics and tracing stack so on-call engineers see why something fired. Log the feature vector, score, model version, and threshold at decision time for postmortems.

Explainability

"Anomaly score 0.97" is not actionable. SHAP values, feature contribution rankings, or simple rules ("amount 12x user's 90-day max AND new device") help analysts trust or dismiss alerts quickly. Regulatory environments often require explainability for automated blocks.

Production checklist

Define what "normal" means per segment — global thresholds fail on heterogeneous users or tenants.
Ship statistical baselines (z-score, IQR, seasonal naive residuals) before complex ML — they are your sanity check.
Choose detectors by data shape: Isolation Forest for tabular, autoencoders for embeddings/images, residual forecasting for seasonal metrics.
Validate with time-based splits and report precision-recall at operational thresholds — never accuracy alone.
Run shadow mode — score and log alerts without paging until confirmation rate is acceptable.
Implement per-segment thresholds, hysteresis, and deduplication to control alert volume.
Capture analyst feedback as labels; schedule retraining when confirmation rate drops.
Log score, features, model version, and threshold for every decision — required for audits and debugging.

Key takeaways

Anomalies are contextual — the same value can be normal at noon and alarming at midnight; segment and season matter.
Start simple — z-score, IQR, and control charts are fast baselines; ML earns its complexity on high-dimensional or subtle patterns.
Isolation Forest and autoencoders cover most tabular and embedding use cases; pick by whether axis-aligned splits or reconstruction error fits your data.
Time order is sacred — use residual forecasting and rolling windows, not random train/test splits on metrics.
Optimize for analyst time — precision at top-K, explainability, and feedback loops matter more than raw recall in production.