Guide
Gaussian processes explained
Harbor Analytics forecasted weekly SKU demand with gradient-boosted trees. Median absolute error was competitive, but the planning team kept over-ordering safety stock because the model returned a single number with no sense of how wrong it might be on sparse categories or during promotional spikes. They replaced the booster with Gaussian process (GP) regression on 180 days of daily sales for mid-volume items: point accuracy stayed within 3% of XGBoost, but the GP's predictive variance widened automatically where data were thin — exactly where planners needed humility. A Gaussian process is a distribution over functions: instead of learning fixed weights, you specify how similar any two inputs should be via a kernel, observe data, and obtain a posterior that gives both a mean prediction and a calibrated uncertainty band at every point. This guide builds GP intuition from kernels through posterior inference, hyperparameter optimization, scaling limits and sparse approximations, works the Harbor demand forecast, provides a method decision table, lists pitfalls, and ends with a practitioner checklist. For sequential black-box tuning that uses GPs as surrogates, see Bayesian optimization; for distribution-free prediction intervals on any model, see conformal prediction.
What a Gaussian process models
Classical regression assumes a parametric form — linear, polynomial, or a fixed
neural architecture. A GP takes a nonparametric Bayesian view: any finite set of
function values f(x1), …, f(xn) is jointly
Gaussian, fully specified by a mean function m(x) and a covariance
function k(x, x') called the kernel.
Given training inputs X and targets y (with optional
observation noise), the posterior at a new point x* is
again Gaussian:
f(x*) | X, y ~ Normal(μ*, σ2*)
The mean μ* is your point prediction; the variance
σ2* quantifies uncertainty. Far from
training data, the posterior reverts toward the prior — uncertainty grows.
Near dense clusters of observations, variance shrinks. That behavior is the product
feature gradient boosters and plain neural nets do not provide out of the box.
Kernel functions encode inductive bias
The kernel answers: how correlated should outputs be at two inputs? Common choices:
- RBF (squared exponential): infinitely smooth functions; strong default for low-dimensional continuous inputs. Length-scale hyperparameter controls how quickly correlation decays with distance.
- Matérn family: less smooth than RBF; Matérn-5/2 is a practical compromise for physical and economic time series that are not infinitely differentiable.
- Periodic kernel: encodes repeating patterns (weekly seasonality in demand, daily cycles in energy load).
- Linear kernel: equivalent to Bayesian linear regression; useful baseline.
- Composite kernels: sums and products combine trends plus seasonality — e.g.
k = ktrend + kseasonal + knoise.
Kernel choice is not cosmetic; it is the main lever for whether the GP extrapolates sensibly. Harbor's demand model uses Matérn-5/2 plus a weekly periodic term on day-of-week index and a separate RBF on promotional-intensity features.
Posterior inference and noise
For noise-free observations, the posterior mean at test points is a kernel-weighted
combination of training targets — a generalization of kriging in geostatistics.
With homoscedastic Gaussian observation noise variance σ2n,
the model becomes:
y = f(x) + ε, where ε ~ Normal(0, σ2n)
Closed-form posterior updates require inverting an n × n kernel
matrix — O(n3) time and O(n2)
memory. That is fine for hundreds or a few thousand points; it breaks at hundred-thousand-row
warehouse logs without approximations.
What uncertainty means in practice
- Epistemic component: uncertainty because you have not seen data in that region of input space.
- Aleatoric component: irreducible noise in observations, modeled via
σ2n. - Credible intervals: a 95% interval uses
μ* ± 1.96 σ*under Gaussian assumptions.
GPs provide analytic uncertainty given the kernel and noise assumptions. If those assumptions are wrong — heavy-tailed outliers, non-stationary variance — intervals can be miscalibrated. Pair with residual diagnostics or methods like conformal prediction when audit-grade coverage is required.
Hyperparameter optimization
Kernels carry hyperparameters: length-scales, variance scales, periodicity, noise level. Two standard approaches:
Marginal likelihood maximization
Integrate out the function values and optimize the log marginal likelihood
of the data with respect to hyperparameters. This is the default in
sklearn.gaussian_process.GaussianProcessRegressor and GPyTorch. It
automatically balances data fit against model complexity — overly flexible
kernels that overfit receive lower marginal likelihood.
Cross-validation
When marginal likelihood optimization is brittle (small n, many
hyperparameters), k-fold CV on held-out log predictive density or RMSE is safer.
See
cross-validation
for split hygiene; time-series forecasts need forward-chaining, not random shuffles.
Feature scaling
RBF and Matérn kernels are sensitive to input scale. Standardize continuous features to zero mean and unit variance before fitting; see feature scaling. Categorical variables typically need explicit embedding or one-hot expansion with per-group length-scales.
Scaling: sparse and inducing-point GPs
Exact GPs do not scale to modern tabular datasets with 100k+ rows. Approximations retain uncertainty while reducing complexity:
- Inducing points / sparse variational GPs: summarize the dataset with
m ≪ nlearned or selected pseudo-inputs; inference drops toward O(nm2) or O(m3). - Local GPs: partition input space and fit independent GPs per region; good for geospatial or multi-modal data.
- Random Fourier features: approximate stationary kernels with explicit feature maps; fast but uncertainty quality varies.
- GPU frameworks: GPyTorch and GPflow implement exact and variational GPs with CUDA acceleration for moderate
n.
Rule of thumb: exact GPs up to roughly 5,000 points on CPU; beyond that, plan for sparse variational methods or switch to a scalable point predictor plus conformal intervals.
GP regression vs classification
The closed-form posterior above is for regression. For classification, the latent function is passed through a sigmoid or softmax; the posterior is no longer analytically tractable. Common approaches:
- Laplace approximation: Gaussian approximation around the MAP latent function; fast, works well for binary problems.
- Variational inference: more flexible, higher cost.
- EP (expectation propagation): accurate for multi-class, implementation-heavy.
For high-dimensional image or text classification, deep networks dominate. GPs shine on small-to-medium tabular problems where calibrated uncertainty matters and data are expensive to collect.
Harbor Analytics: weekly demand with uncertainty bands
Harbor's planning squad targeted 420 mid-volume SKUs across three fulfillment regions. Historical daily unit sales (180 days), calendar features (day of week, week of year), promotional flags, and competitor price-index deltas formed the feature vector. The prior XGBoost regressor achieved 11.2% WAPE on a 28-day holdout but gave no interval guidance; planners added a flat 18% safety buffer, inflating carrying cost.
The team fit a composite kernel GP in GPyTorch with 400 inducing points per region:
- Matérn-5/2 on continuous price and promo-intensity features (length-scales learned via marginal likelihood).
- Weekly periodic kernel on day-of-week index for recurring basket patterns.
- White-noise term for daily count volatility.
Holdout WAPE rose slightly to 11.6% — acceptable. The win was operational: 90% posterior intervals achieved 88% empirical coverage on the holdout (close to nominal). SKUs with sparse promo history showed wide bands; the planner reduced safety stock on 60% of lines while flagging 12 high-uncertainty SKUs for manual review. Peak-season weeks where the GP variance doubled triggered an automatic “review forecast” ticket instead of silent over-ordering.
Harbor re-tunes hyperparameters monthly on a rolling 180-day window and compares interval coverage weekly. When coverage drifts below 85%, they fall back to conformalized residuals on top of the GP mean.
Method decision table
| Need | Gaussian process | Gradient boosting / RF | Deep network | Kalman filter |
|---|---|---|---|---|
| Calibrated uncertainty on small tabular data | Strong fit | Quantile regression or ensembles needed | MC dropout / ensembles; often miscalibrated | Strong for linear-Gaussian state space |
| 100k+ rows, many features | Sparse GP or avoid | Strong fit | Strong fit | EKF/UKF if state-space known |
| Sequential black-box HPO | Surrogate in BO | Not typical | Not typical | N/A |
| Images, text, audio | Impractical (flat kernels) | Moderate | Strong fit | N/A |
| Real-time streaming state estimation | Too slow (batch) | Moderate | Moderate | Strong fit |
| Interpretable similarity structure | Kernel inspectable | SHAP on trees | Harder | Physics model dependent |
Common pitfalls
- O(n3) surprise: fitting exact GPs on 50k rows without approximation stalls pipelines; profile before productionizing.
- Unscaled features: one dollar-denominated column and one 0–1 flag distort length-scale learning; always standardize continuous inputs.
- Wrong kernel for non-stationary data: a single stationary RBF cannot capture structural breaks; use changepoint kernels or segment the series.
- Overconfident extrapolation: GPs revert to the prior outside training support, but misspecified kernels can still look confident; plot posterior samples.
- Ignoring heavy tails: count data and revenue spikes violate Gaussian noise; consider log transforms, Student-t likelihoods, or robust alternatives.
- Random CV on time series: shuffled folds leak future into past; use rolling-origin validation.
- Conflating GP variance with business risk: posterior variance is model uncertainty under assumptions; stress scenarios may still need separate simulation.
Practitioner checklist
- Confirm problem size fits exact GP or plan sparse/inducing-point approximation.
- Standardize continuous features; document categorical encoding.
- Choose kernel family matching smoothness and seasonality (RBF, Matérn, periodic, composite).
- Optimize hyperparameters via marginal likelihood with restarts; fall back to CV if unstable.
- Validate on held-out data with proper time-series splits when applicable.
- Report point metric (RMSE, WAPE) and interval coverage (e.g. 90% nominal vs empirical).
- Plot posterior mean plus credible bands on representative SKUs and edge cases.
- Monitor coverage in production; recalibrate with conformal residuals if drift appears.
- Document kernel formula and hyperparameters in the model card for audit.
- Benchmark against a strong point predictor (XGBoost, linear) to justify GP complexity.
Key takeaways
- A Gaussian process is a distribution over functions defined by a kernel that encodes similarity between inputs.
- Posterior predictions include analytic uncertainty that grows away from data and shrinks near dense observations.
- Kernel and noise hyperparameters are learned via marginal likelihood or cross-validation; scaling inputs is mandatory.
- Exact GPs scale O(n3); sparse variational methods extend reach to larger datasets.
- GPs excel on small-to-medium tabular regression where honest uncertainty bands drive better decisions than point forecasts alone.
Related reading
- Bayesian optimization explained — GP surrogates, acquisition functions and hyperparameter search
- Conformal prediction explained — distribution-free prediction intervals with coverage guarantees
- Kalman filter explained — recursive Gaussian state estimation for streaming sensors
- Feature scaling explained — standardization before distance-based and kernel methods