Guide

Gaussian processes explained

Harbor Analytics forecasted weekly SKU demand with gradient-boosted trees. Median absolute error was competitive, but the planning team kept over-ordering safety stock because the model returned a single number with no sense of how wrong it might be on sparse categories or during promotional spikes. They replaced the booster with Gaussian process (GP) regression on 180 days of daily sales for mid-volume items: point accuracy stayed within 3% of XGBoost, but the GP's predictive variance widened automatically where data were thin — exactly where planners needed humility. A Gaussian process is a distribution over functions: instead of learning fixed weights, you specify how similar any two inputs should be via a kernel, observe data, and obtain a posterior that gives both a mean prediction and a calibrated uncertainty band at every point. This guide builds GP intuition from kernels through posterior inference, hyperparameter optimization, scaling limits and sparse approximations, works the Harbor demand forecast, provides a method decision table, lists pitfalls, and ends with a practitioner checklist. For sequential black-box tuning that uses GPs as surrogates, see Bayesian optimization; for distribution-free prediction intervals on any model, see conformal prediction.

What a Gaussian process models

Classical regression assumes a parametric form — linear, polynomial, or a fixed neural architecture. A GP takes a nonparametric Bayesian view: any finite set of function values f(x₁), …, f(x_n) is jointly Gaussian, fully specified by a mean function m(x) and a covariance function k(x, x') called the kernel.

Given training inputs X and targets y (with optional observation noise), the posterior at a new point x_* is again Gaussian:

f(x_*) | X, y ~ Normal(μ_*, σ²_*)

The mean μ_* is your point prediction; the variance σ²_* quantifies uncertainty. Far from training data, the posterior reverts toward the prior — uncertainty grows. Near dense clusters of observations, variance shrinks. That behavior is the product feature gradient boosters and plain neural nets do not provide out of the box.

Kernel functions encode inductive bias

The kernel answers: how correlated should outputs be at two inputs? Common choices:

RBF (squared exponential): infinitely smooth functions; strong default for low-dimensional continuous inputs. Length-scale hyperparameter controls how quickly correlation decays with distance.
Matérn family: less smooth than RBF; Matérn-5/2 is a practical compromise for physical and economic time series that are not infinitely differentiable.
Periodic kernel: encodes repeating patterns (weekly seasonality in demand, daily cycles in energy load).
Linear kernel: equivalent to Bayesian linear regression; useful baseline.
Composite kernels: sums and products combine trends plus seasonality — e.g. k = k_trend + k_seasonal + k_noise.

Kernel choice is not cosmetic; it is the main lever for whether the GP extrapolates sensibly. Harbor's demand model uses Matérn-5/2 plus a weekly periodic term on day-of-week index and a separate RBF on promotional-intensity features.

Posterior inference and noise

For noise-free observations, the posterior mean at test points is a kernel-weighted combination of training targets — a generalization of kriging in geostatistics. With homoscedastic Gaussian observation noise variance σ²_n, the model becomes:

y = f(x) + ε, where ε ~ Normal(0, σ²_n)

Closed-form posterior updates require inverting an n × n kernel matrix — O(n³) time and O(n²) memory. That is fine for hundreds or a few thousand points; it breaks at hundred-thousand-row warehouse logs without approximations.

What uncertainty means in practice

Epistemic component: uncertainty because you have not seen data in that region of input space.
Aleatoric component: irreducible noise in observations, modeled via σ²_n.
Credible intervals: a 95% interval uses μ_* ± 1.96 σ_* under Gaussian assumptions.

GPs provide analytic uncertainty given the kernel and noise assumptions. If those assumptions are wrong — heavy-tailed outliers, non-stationary variance — intervals can be miscalibrated. Pair with residual diagnostics or methods like conformal prediction when audit-grade coverage is required.

Hyperparameter optimization

Kernels carry hyperparameters: length-scales, variance scales, periodicity, noise level. Two standard approaches:

Marginal likelihood maximization

Integrate out the function values and optimize the log marginal likelihood of the data with respect to hyperparameters. This is the default in sklearn.gaussian_process.GaussianProcessRegressor and GPyTorch. It automatically balances data fit against model complexity — overly flexible kernels that overfit receive lower marginal likelihood.

Cross-validation

When marginal likelihood optimization is brittle (small n, many hyperparameters), k-fold CV on held-out log predictive density or RMSE is safer. See cross-validation for split hygiene; time-series forecasts need forward-chaining, not random shuffles.

Feature scaling

RBF and Matérn kernels are sensitive to input scale. Standardize continuous features to zero mean and unit variance before fitting; see feature scaling. Categorical variables typically need explicit embedding or one-hot expansion with per-group length-scales.

Scaling: sparse and inducing-point GPs

Exact GPs do not scale to modern tabular datasets with 100k+ rows. Approximations retain uncertainty while reducing complexity:

Inducing points / sparse variational GPs: summarize the dataset with m ≪ n learned or selected pseudo-inputs; inference drops toward O(nm²) or O(m³).
Local GPs: partition input space and fit independent GPs per region; good for geospatial or multi-modal data.
Random Fourier features: approximate stationary kernels with explicit feature maps; fast but uncertainty quality varies.
GPU frameworks: GPyTorch and GPflow implement exact and variational GPs with CUDA acceleration for moderate n.

Rule of thumb: exact GPs up to roughly 5,000 points on CPU; beyond that, plan for sparse variational methods or switch to a scalable point predictor plus conformal intervals.

GP regression vs classification

The closed-form posterior above is for regression. For classification, the latent function is passed through a sigmoid or softmax; the posterior is no longer analytically tractable. Common approaches:

Laplace approximation: Gaussian approximation around the MAP latent function; fast, works well for binary problems.
Variational inference: more flexible, higher cost.
EP (expectation propagation): accurate for multi-class, implementation-heavy.

For high-dimensional image or text classification, deep networks dominate. GPs shine on small-to-medium tabular problems where calibrated uncertainty matters and data are expensive to collect.

Harbor Analytics: weekly demand with uncertainty bands

Harbor's planning squad targeted 420 mid-volume SKUs across three fulfillment regions. Historical daily unit sales (180 days), calendar features (day of week, week of year), promotional flags, and competitor price-index deltas formed the feature vector. The prior XGBoost regressor achieved 11.2% WAPE on a 28-day holdout but gave no interval guidance; planners added a flat 18% safety buffer, inflating carrying cost.

The team fit a composite kernel GP in GPyTorch with 400 inducing points per region:

Matérn-5/2 on continuous price and promo-intensity features (length-scales learned via marginal likelihood).
Weekly periodic kernel on day-of-week index for recurring basket patterns.
White-noise term for daily count volatility.

Holdout WAPE rose slightly to 11.6% — acceptable. The win was operational: 90% posterior intervals achieved 88% empirical coverage on the holdout (close to nominal). SKUs with sparse promo history showed wide bands; the planner reduced safety stock on 60% of lines while flagging 12 high-uncertainty SKUs for manual review. Peak-season weeks where the GP variance doubled triggered an automatic “review forecast” ticket instead of silent over-ordering.

Harbor re-tunes hyperparameters monthly on a rolling 180-day window and compares interval coverage weekly. When coverage drifts below 85%, they fall back to conformalized residuals on top of the GP mean.

Method decision table

Need	Gaussian process	Gradient boosting / RF	Deep network	Kalman filter
Calibrated uncertainty on small tabular data	Strong fit	Quantile regression or ensembles needed	MC dropout / ensembles; often miscalibrated	Strong for linear-Gaussian state space
100k+ rows, many features	Sparse GP or avoid	Strong fit	Strong fit	EKF/UKF if state-space known
Sequential black-box HPO	Surrogate in BO	Not typical	Not typical	N/A
Images, text, audio	Impractical (flat kernels)	Moderate	Strong fit	N/A
Real-time streaming state estimation	Too slow (batch)	Moderate	Moderate	Strong fit
Interpretable similarity structure	Kernel inspectable	SHAP on trees	Harder	Physics model dependent

Common pitfalls

O(n³) surprise: fitting exact GPs on 50k rows without approximation stalls pipelines; profile before productionizing.
Unscaled features: one dollar-denominated column and one 0–1 flag distort length-scale learning; always standardize continuous inputs.
Wrong kernel for non-stationary data: a single stationary RBF cannot capture structural breaks; use changepoint kernels or segment the series.
Overconfident extrapolation: GPs revert to the prior outside training support, but misspecified kernels can still look confident; plot posterior samples.
Ignoring heavy tails: count data and revenue spikes violate Gaussian noise; consider log transforms, Student-t likelihoods, or robust alternatives.
Random CV on time series: shuffled folds leak future into past; use rolling-origin validation.
Conflating GP variance with business risk: posterior variance is model uncertainty under assumptions; stress scenarios may still need separate simulation.

Practitioner checklist

Confirm problem size fits exact GP or plan sparse/inducing-point approximation.
Standardize continuous features; document categorical encoding.
Choose kernel family matching smoothness and seasonality (RBF, Matérn, periodic, composite).
Optimize hyperparameters via marginal likelihood with restarts; fall back to CV if unstable.
Validate on held-out data with proper time-series splits when applicable.
Report point metric (RMSE, WAPE) and interval coverage (e.g. 90% nominal vs empirical).
Plot posterior mean plus credible bands on representative SKUs and edge cases.
Monitor coverage in production; recalibrate with conformal residuals if drift appears.
Document kernel formula and hyperparameters in the model card for audit.
Benchmark against a strong point predictor (XGBoost, linear) to justify GP complexity.

Key takeaways

A Gaussian process is a distribution over functions defined by a kernel that encodes similarity between inputs.
Posterior predictions include analytic uncertainty that grows away from data and shrinks near dense observations.
Kernel and noise hyperparameters are learned via marginal likelihood or cross-validation; scaling inputs is mandatory.
Exact GPs scale O(n³); sparse variational methods extend reach to larger datasets.
GPs excel on small-to-medium tabular regression where honest uncertainty bands drive better decisions than point forecasts alone.