Guide

Time series forecasting explained

Your ops team needs to know how many servers to provision next Tuesday. Finance wants next quarter's revenue. A retailer must restock before Black Friday. Each problem shares the same structure: ordered observations over time, where yesterday's value influences today's and random shuffling would destroy the signal. Time series forecasting predicts future values from that history — but it breaks many assumptions that work for ordinary tabular machine learning. This guide covers decomposition, classical models like ARIMA, modern gradient boosting and deep learning approaches, how to validate without leakage via walk-forward splits, evaluation metrics, and a checklist before you ship a forecaster into production.

What makes time series different

A time series is a sequence of measurements indexed by time — hourly API latency, daily active users, monthly sales. Unlike IID rows in a customer churn dataset, consecutive points are autocorrelated: today's temperature correlates with yesterday's. That dependence is the signal you exploit, but it also means a random train/test split leaks future information into training and inflates accuracy beyond what production will deliver.

Most series contain three components you can often separate:

Trend — long-term direction (growth, decline, or flat).
Seasonality — repeating patterns at fixed intervals (daily commute peaks, weekly retail cycles, annual holiday spikes).
Residual noise — irregular shocks after trend and seasonality are removed.

Stationarity is another core concept. A stationary series has constant mean and variance over time. Many models — especially ARIMA — assume stationarity or require differencing (subtracting the previous value) to achieve it. A series that trends upward forever is non-stationary until you model or difference away the trend.

Classical forecasting methods

Before reaching for neural networks, classical methods remain strong baselines — fast to fit, interpretable, and often hard to beat on short horizons with clear seasonality.

Moving averages and exponential smoothing

A simple moving average smooths noise by averaging the last k observations. Exponential smoothing weights recent points more heavily. Holt-Winters extends this with separate smoothing equations for level, trend, and seasonal components — the foundation of the ETS (Error-Trend-Seasonality) family used in many demand-planning tools.

ARIMA and SARIMA

ARIMA (AutoRegressive Integrated Moving Average) models three ideas in one equation: autoregression (past values predict the future), integration (differencing to remove trend), and a moving-average term on past forecast errors. You specify orders (p, d, q) — how many lag terms, differencing steps, and error terms to include. SARIMA adds seasonal AR and MA terms for repeating cycles (e.g., weekly seasonality with period 7 on daily data). Auto-ARIMA routines search the order space with information criteria like AIC.

Prophet and structural models

Facebook's Prophet decomposes series into trend, multiple seasonalities (daily, weekly, yearly), and holiday effects via additive regression. It handles missing data and changepoints in trend gracefully, which makes it popular for business metrics with known calendar events. Prophet is not magic — it struggles on high-frequency or very short histories — but it gives analysts a readable forecast with uncertainty intervals out of the box.

Machine learning and deep learning approaches

When you have rich exogenous features — promotions, weather, competitor prices — or thousands of related series, ML methods often outperform univariate classical models.

Feature engineering for time series

Tree-based models do not natively understand order; you must encode it. Standard feature engineering for forecasting includes:

Lag features — y_{t-1}, y_{t-7}, y_{t-365} for the same calendar day last year.
Rolling statistics — 7-day mean, 30-day standard deviation, expanding max.
Calendar features — day of week, month, is_holiday, days_until_payday.
Target encoding with care — never compute aggregates that include the row you are predicting (classic leakage).

Gradient boosting on tabular time features

LightGBM and XGBoost on lag-and-calendar features frequently win Kaggle demand-forecasting competitions. They handle missing values, mixed feature types, and nonlinear interactions without hand-tuning ARIMA orders. The trade-off is interpretability and the need for careful cross-validation that respects time order.

Deep learning: LSTM, TFT, and foundation forecasters

Recurrent networks (LSTM, GRU) and temporal convolutional networks learn patterns directly from sequences. The Temporal Fusion Transformer (TFT) combines attention with static covariates and known future inputs (e.g., planned promotions). Recent foundation models (Chronos, TimesFM, Lag-Llama) pre-train on massive multi-domain corpora and zero-shot forecast new series — promising for cold-start problems but heavier to deploy than a tuned LightGBM pipeline.

Validation: never shuffle time

Random k-fold cross-validation on time series is one of the most common mistakes in forecasting projects. If your training fold includes data from March and your validation fold is February, the model has already seen the future during training.

Use walk-forward validation (rolling-origin evaluation) instead:

Train on data through time T.
Forecast the next h steps (the horizon).
Advance T by one or h steps and repeat.
Average error across all windows.

Match the validation horizon to your production horizon. A model tuned for one-step-ahead accuracy may fail badly at 30-day forecasts because error compounds. For multiple related series (every SKU in a catalog), group-aware splits prevent the same product appearing in both train and test at overlapping times.

Evaluation metrics

Choose metrics that reflect business cost and scale across series.

MAE (mean absolute error) — average absolute miss in original units; robust to outliers.
RMSE — penalizes large errors more; sensitive to spikes.
MAPE (mean absolute percentage error) — scale-free but undefined or unstable when actuals are near zero.
sMAPE — symmetric percentage error; still problematic at zero but slightly more stable.
MASE — scaled against a naive seasonal baseline; useful for comparing models across series with different magnitudes.

Report prediction intervals, not just point forecasts. Inventory and staffing decisions need a range ("90% chance demand is between 800 and 1,200 units"). Quantile regression, conformal prediction, or model-native intervals (Prophet, TFT) supply this. A point forecast that looks accurate on average but misses tails will still cause stockouts.

Forecast horizons and granularity

The right aggregation level is a product decision, not just a modeling one. Hourly forecasts enable real-time autoscaling; monthly forecasts suit budget planning. Finer granularity increases noise and data volume; coarser granularity hides intra-day patterns you might need.

Direct vs recursive multi-step forecasting: a direct model trains separate heads (or models) for each horizon step. A recursive model feeds its own prediction back as input for the next step — error accumulates but requires only one model. Direct methods often win at longer horizons; recursive methods are simpler to maintain.

Hierarchical forecasting reconciles forecasts at multiple levels (SKU, category, total) so they add up consistently — important when finance reconciles bottom-up demand plans against top-down targets.

Production pitfalls

Data leakage through features

Rolling means that include the current row, future-known promotion flags applied to the wrong timestamps, and global normalizers fit on the full dataset all leak information. Fit scalers and encoders only on training windows during walk-forward evaluation — the same discipline as train-serve parity in tabular ML.

Cold start and new series

ARIMA needs enough history to detect seasonality. New products have no lags. Mitigations: borrow patterns from similar items (hierarchical pooling), use global models trained across all series, or fall back to naive seasonal baselines until data accumulates.

Concept drift and regime change

A demand model trained pre-pandemic may fail after consumer behavior shifts. Monitor forecast error over rolling windows and watch for concept drift — not just input distribution shift but a changed relationship between features and the target. Retrain triggers should fire on sustained MAPE degradation, not single bad weeks (which may be legitimate shocks).

Operational concerns

Batch nightly retraining differs from online one-step updates. Store model versions, training data snapshots, and feature definitions so you can reproduce any published forecast. Latency matters if forecasts feed real-time autoscaling — a TFT ensemble may be too slow where a precomputed lookup table suffices.

Production checklist

Define the forecast horizon and granularity to match the business decision (hourly scale-out vs quarterly budget).
Decompose a sample series into trend, seasonality, and residual — sanity-check whether univariate methods are sufficient.
Establish naive baselines (last value, seasonal naive, moving average) before claiming ML wins.
Engineer lag, rolling, and calendar features without leakage; document which features require known-future inputs.
Validate with walk-forward splits at the production horizon — never random shuffle.
Report MAE/RMSE and a scale-free metric (MASE or sMAPE) plus prediction intervals.
Plan for cold-start series and hierarchical reconciliation if forecasting at multiple aggregation levels.
Monitor rolling forecast error and retrain on drift; version models and training data for auditability.

Key takeaways

Time order is sacred — autocorrelation is signal, but random train/test splits leak the future and lie about accuracy.
Decompose before you model — trend and seasonality explain most business series; classical methods remain strong baselines.
ML needs temporal features — lags, rolling stats, and calendar encodings let tree models compete with ARIMA on rich datasets.
Validate walk-forward at the real horizon — one-step accuracy does not guarantee 30-day forecasts.
Ship intervals and monitor drift — point forecasts alone hide tail risk; sustained error spikes demand retraining.