Guide

Feature engineering explained

In tabular machine learning, the model only sees numbers you hand it. A signup timestamp, a product category string, and a free-text review are not inherently predictive — but hours_since_signup, category_is_electronics, and review_sentiment_score can be. Feature engineering is the craft of transforming raw columns into those signals: scaling magnitudes, encoding categories, deriving time-based patterns, building interaction terms, and documenting everything so training and production serving stay identical. Deep networks learn representations automatically, yet even LLM pipelines still depend on engineered metadata (user tier, document type, retrieval scores). This guide covers the core techniques, the leakage traps that inflate offline metrics, and how feature stores keep teams honest at scale.

What features are and why engineering matters

A feature is one input dimension the model uses to predict a target. Labels come from the future (will this user churn? is this email spam?); features must be knowable at prediction time with only information available then. Good features compress domain knowledge into a form gradient-based learners can exploit: monotonic relationships, separable clusters, or sparse indicators that tree ensembles can split on.

Why bother when deep learning learns embeddings end-to-end? Three reasons remain relevant in 2026:

  • Tabular data still dominates fraud, credit, pricing, churn, and ops forecasting — gradient-boosted trees on engineered features often beat generic neural nets on structured rows.
  • Data efficiency — when you have thousands of rows, not millions, thoughtful features beat throwing raw columns at a large model.
  • Interpretability and compliance — regulated industries need to explain why a score changed; engineered features map to business concepts.

Feature engineering is iterative: train a baseline, inspect errors, add a feature that fixes a visible failure mode, repeat. The best teams treat the feature catalog as versioned product code, not notebook scratch work.

Numeric features: scaling and transforms

Models sensitive to magnitude — logistic regression, k-nearest neighbors, neural nets with unnormalized inputs — need features on comparable scales. Common approaches:

  • Standardization (z-score) — subtract mean, divide by standard deviation. Works when distributions are roughly Gaussian.
  • Min-max scaling — squash to [0, 1]. Sensitive to outliers; use when bounds are meaningful (percentages).
  • Robust scaling — median and interquartile range instead of mean/std. Better for heavy-tailed revenue or latency columns.
  • Log and power transformslog1p(price) tames skewed spend; Box-Cox or Yeo-Johnson when you need automated lambda search.
  • Binning / bucketing — convert continuous age into decade buckets when the relationship is nonlinear and you want tree-friendly splits.

Fit scalers on the training set only, then apply the same parameters to validation, test, and live traffic. Refitting on the full dataset before holdout evaluation leaks distribution statistics from the test partition into training — a subtle form of data leakage.

Missing values

Do not blindly drop rows or fill with zero. Options depend on why data is missing:

  • Missingness indicator — add income_is_missing as its own feature; sometimes absence is predictive.
  • Imputation — median for numeric, mode for categorical, or model-based imputation fit inside cross-validation folds.
  • Learned embeddings — neural models can use a dedicated "unknown" bucket; tree models often prefer explicit sentinel values with indicators.

Categorical encoding

Strings and enums must become numbers. The encoding choice affects cardinality, memory, and leakage risk:

  • One-hot encoding — one binary column per category. Fine for low cardinality (country, day-of-week). Explodes with high-cardinality product SKUs unless you cap rare levels to __OTHER__.
  • Ordinal encoding — integer codes when order matters (education level: high school < bachelor < PhD). Never use arbitrary integer codes for nominal categories — the model will treat red=1, blue=2 as if blue is twice red.
  • Target encoding (mean encoding) — replace each category with the historical mean target in training data. Powerful for tree models on high-cardinality IDs, but must be computed inside cross-validation folds or you leak the label into features.
  • Hashing trick — map categories to a fixed number of buckets via hash function. Collisions trade accuracy for memory; common in click-through rate models with billions of ad IDs.
  • Embedding layers — in neural nets, learn dense vectors per category, analogous to how transfer learning reuses pretrained representations.

Always handle unseen categories at serve time: map to __UNKNOWN__, use the global mean for target encoding, or rely on hash buckets that never fail — but document the fallback.

Datetime, text, and derived features

Timestamps hide structure until you extract it:

  • Calendar parts — hour, day-of-week, week-of-year, is_weekend, is_holiday (join a holiday table).
  • Cyclical encoding — represent hour as sin(2π·hour/24) and cos(...) so 23:00 is close to 00:00.
  • Elapsed time — days_since_last_purchase, account_age_days. Ensure the reference "now" at training matches production (prediction timestamp, not row ingestion time).
  • Recency-weighted aggregates — rolling 7-day click count, exponential moving average of spend.

For text columns, classical pipelines use TF-IDF n-grams, character n-grams for typos, or hand-crafted regex flags (contains "urgent", has URL). Modern stacks often replace raw text with embedding vectors plus lightweight metadata (language, length, entity counts from NER). Hybrid approaches — embeddings for semantics, engineered flags for compliance rules — are common in production search and moderation.

Interaction and aggregation features

Single columns rarely capture joint effects. Useful patterns:

  • Cross termsprice_per_sqft = price / sqft, cart_value_times_discount.
  • Group aggregates — mean order value per user, fraud rate per merchant (again: compute on training folds only to avoid leakage).
  • Rank and percentile — user's spend percentile within cohort normalizes for macro trends.

Tree ensembles (XGBoost, LightGBM, CatBoost) discover some interactions automatically, but explicit domain ratios still help when the relationship is obvious and data is sparse.

Data leakage and train-serve skew

Leakage means the model sees information at training time that would not exist at prediction time — or that directly encodes the label. Offline AUC looks stellar; production fails silently.

Common leakage sources:

  • Target leakage — including post-outcome fields ("refund_issued" to predict "will_churn").
  • Temporal leakage — random train/test split on time-series data lets the model train on future rows. Use chronological splits or rolling-origin validation.
  • Aggregate leakage — computing global category means on the full dataset before splitting.
  • Duplicate entities — the same user in train and test with different rows still leaks behavioral signal.
  • Pipeline leakage — fitting imputers or PCA on train+test combined.

Train-serve skew is the cousin of leakage: training code paths differ from production. Maybe notebooks use pandas string ops but the API serves protobuf enums; maybe training fills nulls with median but live traffic sends null through unchanged. Mitigations:

  • One shared transformation library invoked in both batch training and online inference.
  • Feature stores that log point-in-time correct aggregates (see below).
  • Contract tests comparing training batch output schema to serving output for sample rows.

Feature selection and validation discipline

More features are not always better. High dimensionality increases overfitting risk and serving cost. Prune with purpose:

  • Filter methods — correlation with target, mutual information (fast screening).
  • Embedded methods — L1 regularization, tree feature importance (watch for correlated feature instability).
  • Recursive elimination — iteratively drop low-importance columns and re-evaluate on a proper validation set.

Never select features using test-set performance repeatedly — that becomes indirect overfitting. Hold out a final test set untouched until you ship, or use nested cross-validation when data is small.

Feature stores and production pipelines

At scale, teams centralize definitions in a feature store (Feast, Tecton, Vertex Feature Store, etc.). Benefits:

  • Point-in-time correctness — joins historical feature values as they were known at each label timestamp, preventing future data in training backfills.
  • Reuse — one "user_30d_spend" definition powers training, batch scoring, and real-time APIs.
  • Monitoring — drift detection on feature distributions alerts before model quality collapses.

Whether or not you adopt a store, version feature definitions alongside model artifacts. When debugging a bad deployment, you need to know which encoding of country v3 shipped with model v17.

Production checklist

  1. Document each feature's definition, source table, refresh cadence, and whether it is allowed at prediction time.
  2. Fit all transformers (scalers, encoders, vocabularies) on training data only; persist parameters with the model bundle.
  3. Handle unseen categories and nulls explicitly in serving code — no silent pandas defaults.
  4. Use time-aware splits for any data with temporal ordering; never shuffle future into train.
  5. Compute target encodings and group aggregates inside cross-validation folds.
  6. Share one transformation module between offline training and online inference.
  7. Add contract tests: given fixed raw input, training pipeline output equals serving output.
  8. Monitor feature drift (PSI, KL divergence) and null-rate spikes in production.
  9. Cap high-cardinality categoricals; bucket rare levels to __OTHER__.
  10. Review error cases quarterly — new engineered features should fix real failure clusters, not chase leaderboard noise.

Key takeaways

  • Features encode domain knowledge — raw columns are a starting point, not the finished product.
  • Encoding and scaling choices matter — especially for linear models, distance metrics, and neural nets.
  • Leakage inflates offline metrics — temporal splits and fold-safe target encoding are non-negotiable.
  • Train-serve parity — the production path must run the same transforms as training.
  • Deep learning reduces but does not eliminate engineering — metadata, retrieval scores, and tabular side channels still need careful design.

Related reading