Guide

Feature selection explained

Your fraud model ingests 180 columns after feature engineering — device fingerprints, velocity counters, merchant categories, and dozens of interaction terms. Offline AUC looks strong, but production latency spikes and the model memorizes rare combinations that never repeat. Most columns add noise, not signal. Feature selection chooses a smaller subset of inputs that preserves predictive power while cutting overfitting risk, training cost, and serving complexity. It is distinct from engineering (creating features) and from dimensionality reduction (transforming features into new axes). Selection keeps original columns and drops the rest. Done inside proper cross-validation folds, it improves generalization. Done on the full training set before splitting, it leaks information and inflates validation scores. This guide covers filter, wrapper, and embedded methods, multicollinearity, high-dimensional pitfalls, a worked credit-scoring example, a method decision table, common mistakes, and a practitioner checklist.

Why feature selection matters

The curse of dimensionality means that as you add features, data becomes sparse in high-dimensional space — distance metrics stop separating classes, and models need exponentially more rows to generalize. Irrelevant features also increase variance: the model fits noise in columns that have no causal link to the label.

Practical benefits of a lean feature set:

  • Generalization — fewer degrees of freedom, less room to memorize training quirks.
  • Training speed — gradient boosting and linear solvers scale with feature count.
  • Serving cost — fewer upstream computations and smaller model payloads at inference.
  • Interpretability — compliance teams can audit twelve features easier than two hundred.
  • Stability — redundant correlated columns cause coefficient or importance scores to swing between retrains.

Feature selection is most valuable on tabular structured data where you control the column list. Deep vision and language models learn internal representations; selection there usually means architecture pruning or attention analysis, not dropping raw pixels.

Three families of selection methods

Textbooks group techniques by when the learning algorithm participates in the choice. Each family trades compute for optimality.

Filter methods (score features independently)

Filters rank each column using a statistical test that ignores the final model. They are fast and scale to thousands of columns, but treat features in isolation — two weak individual predictors that interact strongly may both rank low.

  • Variance threshold — drop near-constant columns (e.g. 99.9% zeros).
  • Correlation with target — Pearson for regression, point-biserial for binary labels.
  • Chi-square / ANOVA F-test — independence between categorical features and class.
  • Mutual information — captures nonlinear dependence; useful when relationships are not linear.

Typical workflow: compute scores on the training fold only, keep top-k or all features above a threshold, then train the model. Filters are a strong first pass before expensive wrappers.

Wrapper methods (search subsets by model performance)

Wrappers treat feature subsets as a search problem. They train the target model (or a proxy) on candidate subsets and keep the combination with best validation score. Accurate but combinatorially expensive — exhaustive search over 30 features is already billions of subsets.

  • Forward selection — start empty, greedily add the feature that most improves validation metric.
  • Backward elimination — start with all features, drop the least useful one per step.
  • Recursive Feature Elimination (RFE) — train model, remove lowest-importance features, repeat until k remain.
  • Sequential feature selector (SFS) — sklearn implementation with forward/backward and cross-validated scoring.

Use wrappers when you have moderate feature counts (roughly 20–80) and offline compute budget. Always nest selection inside cross-validation — never pick features on the full dataset then evaluate on a holdout drawn from the same rows.

Embedded methods (selection during training)

Embedded methods bake selection into the learning algorithm itself. The model and the subset choice happen in one step.

  • L1 regularization (Lasso) — drives logistic or linear regression coefficients to exactly zero, effectively removing features.
  • Elastic Net — mixes L1 and L2; keeps correlated groups instead of picking one arbitrarily.
  • Tree-based importance — split gain or permutation importance from random forests and gradient boosting; drop features below a threshold.
  • Regularized boosting — XGBoost/LightGBM with strong L1 penalties on leaf weights.

Embedded methods are the production sweet spot for tabular ML: one training pass, reasonable cost, and selection aligned with the model you actually deploy.

Multicollinearity and redundant features

Two columns measuring the same concept — annual_income and monthly_income × 12 — inflate variance in linear models and split importance across both in tree ensembles. High pairwise correlation (|r| > 0.9) is a signal to drop or combine.

Strategies:

  • Correlation clustering — group features by correlation matrix, keep one representative per cluster.
  • Variance Inflation Factor (VIF) — for linear models, drop features with VIF above 5–10.
  • Domain deduplication — prefer the feature that is cheaper to compute at serving time.
  • PCA or embeddings — when you need all information but not all columns; that is reduction, not selection.

Tree models tolerate correlated features better than linear models, but redundancy still hurts interpretability and can destabilize SHAP explanations across retrains.

Leakage and cross-validation discipline

The most common feature-selection mistake is fitting the selector on all training data (including rows that later appear in validation folds). The selector sees label information from validation rows indirectly, and your reported AUC is optimistic.

Correct pattern: wrap the entire pipeline — imputation, scaling, feature selection, model — in a single sklearn Pipeline and run cross-validation on that pipeline. Selection happens independently within each training fold; validation folds are untouched until scoring.

  • Never use test-set labels to rank features, even once.
  • Fit scalers and selectors only on training folds, transform validation with those fitted objects.
  • Log the final feature list from a refit on full training data after hyperparameter tuning — not before.
  • Monitor selected features across retrains; wild churn signals unstable selection or data drift.

High-dimensional settings (p >> n)

When you have more features than rows — genomics, text bag-of-words, wide event logs — unregularized selection overfits aggressively. A filter pass followed by L1-regularized logistic regression (or stability selection: rerun Lasso on bootstrap samples and keep features selected in most runs) is the standard recipe.

Avoid forward wrapper search when p is in the thousands; each subset evaluation retrains a model and multiplies compute. Mutual-information filters plus embedded L1 scale better. For text, prefer learned embeddings over selecting individual vocabulary tokens.

Worked example: credit default with 48 features

A lender models default risk with 48 applicant attributes: income, debt ratios, bureau tradelines, employment tenure, and engineered velocity features. Baseline gradient boosting on all 48 columns achieves 0.81 validation AUC but serving requires 48 upstream queries.

  1. Filter pass — drop 6 near-zero-variance columns; remove 4 pairs with |r| > 0.95 (keep lower-latency column from each pair). 38 remain.
  2. Embedded pass — train LightGBM with feature_fraction=0.8 and inspect permutation importance; discard features with importance below 0.1% of total gain. 16 remain.
  3. Wrapper confirmation — sequential forward selection on the 16, adding one at a time with 5-fold CV. Top 11 features reach 0.812 AUC; adding more features does not improve CV mean beyond noise.
  4. Deploy — production pipeline computes only 11 features; latency drops 60%, AUC holds on the untouched test set.

The final set includes debt_to_income, months_since_last_delinquency, and utilization_ratio — interpretable to underwriters — while dropping redundant bureau score variants that added variance without lift.

Method decision table

Method family Best when Compute cost Captures interactions
Filter (MI, chi-square) Thousands of columns, first-pass pruning Low No
Wrapper (RFE, SFS) 20–80 features, moderate budget, need optimal subset High Yes (via model)
Embedded (Lasso, tree importance) Production tabular models, regularized linear or boosting Medium Partial (trees yes, Lasso linear only)
No selection (use all) Deep nets on raw inputs, or very few strong domain features Low upfront Model-dependent

Common pitfalls

  • Selecting on the full dataset before CV — leaks label information into validation scores.
  • Chasing training AUC — adding features that help train but hurt test; use held-out validation.
  • Ignoring serving cost — keeping a feature that requires an expensive third-party API call.
  • Unstable selection across retrains — if the feature set changes wildly weekly, the process is too aggressive or data is drifting.
  • Dropping protected-proxy columns without review — zip code may proxy for demographics; legal review matters beyond statistical importance.
  • Confusing importance with causality — selected features predict; they do not prove mechanism.
  • Double-dipping with hyperparameter tuning — tune model and feature count in the same loop without nested CV and you overfit both.

Practitioner checklist

  • Start with variance and correlation filters to remove obvious dead weight.
  • Nest selection inside cross-validated pipelines — never on the full training set alone.
  • Prefer embedded methods (L1, tree importance) for production tabular models.
  • Use wrappers only when feature count is moderate and compute allows.
  • Check multicollinearity before interpreting linear coefficients.
  • Log the final feature list and version it with each model release.
  • Validate selected features on a untouched test set once after tuning.
  • Measure serving latency and upstream dependency count, not just AUC.
  • Monitor feature-importance stability across scheduled retrains.
  • Document why each retained feature is business-meaningful for audit trails.

Key takeaways

  • Feature selection drops irrelevant columns; engineering creates them; PCA transforms them.
  • Filter, wrapper, and embedded methods trade speed, accuracy, and compute — most production pipelines combine a filter pass with embedded selection.
  • Cross-validation nesting is non-negotiable; selection on the full training set leaks information.
  • Multicollinearity makes linear models unstable and splits importance across redundant columns.
  • The goal is not the smallest model — it is the smallest set that generalizes and is cheap to serve.

Related reading