Guide

Feature selection explained

A credit-risk model with 400 engineered columns trains slowly, overfits noise, and confuses auditors who ask why browser_timezone_offset predicts default. Feature selection answers a practical question: which inputs actually carry signal, and which can you drop without hurting generalization? Unlike feature engineering, which creates new signals, selection prunes the set you already have. The payoff is faster training, lower inference cost, simpler explanations, and often better out-of-sample performance when redundant or leaky columns are removed. This guide covers filter, wrapper, and embedded methods, multicollinearity diagnostics, cross-validation-safe pipelines, a Harbor Payments credit-scoring worked example, a method decision table, pitfalls, and a production checklist — building on machine learning fundamentals and cross-validation discipline.

Why selection beats using every column

More features are not automatically better. Each additional dimension increases the volume of input space your model must cover — the curse of dimensionality. Sparse regions get filled by interpolation that does not generalize. Redundant features inflate variance in linear models and split importance in trees without adding information. Noisy columns act as random memorization hooks, especially when p approaches or exceeds n.

Selection also serves operations: fewer features mean smaller serialized models, lower latency at scoring time, and cheaper feature-store backfills. For regulated domains, a 25-feature fraud or credit model is easier to document than a 400-column black box. The goal is not minimalism for its own sake — it is keeping columns that improve validation metrics while dropping those that only help the training set.

Selection vs extraction

Feature selection keeps a subset of original columns. Feature extraction (PCA, autoencoders) builds new combined dimensions that are harder to interpret but can compress correlated groups. Start with selection when interpretability and compliance matter; reach for extraction when you need aggressive compression and can sacrifice per-column explanations.

Filter methods: score columns before training

Filters rank or threshold features using statistics computed on the data alone — no iterative model fitting. They are fast, parallelizable, and a good first pass on high-dimensional tabular data.

Variance and missingness thresholds

Drop near-constant columns: if 99.8% of rows share the same value, the feature cannot discriminate. Likewise remove columns with excessive missing rates unless missingness itself is informative (encode as a flag, then re-evaluate).

Correlation and redundancy

For numeric pairs with |r| > 0.95, keep one representative. Pearson correlation on raw scales can mislead when relationships are nonlinear — rank correlation or mutual information catches monotonic ties Pearson misses.

Chi-square and mutual information

For classification, chi-square tests independence between each feature and the label (works on count data and binned numerics). Mutual information (MI) measures how much knowing a feature reduces uncertainty about the target — it captures nonlinear relationships chi-square may miss. sklearn.feature_selection.SelectKBest with mutual_info_classif is a common pattern: keep top-k scorers, then train your real model on the subset.

Filters ignore feature interactions: two weak columns might combine into a strong signal that univariate MI never surfaces. That is why filters are often stage one, not the final word.

Wrapper methods: search with your actual model

Wrappers treat feature subsets as a search problem. They train (or partially train) a model on candidate subsets and score them with cross-validation. More accurate than filters for interaction-heavy problems; exponentially more expensive.

Recursive feature elimination (RFE)

Train a model with all features, rank by importance (coefficient magnitude for linear models, feature_importances_ for tree ensembles), drop the weakest, repeat until k features remain. RFECV wraps RFE in cross-validation to pick k automatically. RFE with logistic regression or gradient boosting is a workhorse for tabular credit and fraud pipelines.

Sequential forward and backward selection

Forward selection starts empty and adds the feature that most improves validation score. Backward elimination starts full and removes the least helpful. Both are greedy — they can miss optimal combinations — but with 50–150 candidates they often beat univariate filters on structured data.

Cost control

Limit search depth: run filters first to cut 400 columns to 80, then RFE on the survivors. Use stratified k-fold with a consistent random seed. Log every subset score so you can audit why a column survived.

Embedded methods: selection inside training

Embedded methods bake sparsity or importance into the learning algorithm itself. No separate search loop — regularization or tree structure does the pruning.

L1 (Lasso) and elastic net

Lasso regression penalizes the sum of absolute coefficients. Irrelevant features shrink exactly to zero, yielding a sparse linear model. Elastic net mixes L1 and L2 — better when many correlated features should survive as a group rather than one arbitrary winner. After scaling numeric features, fit Lasso with cross-validated alpha (LassoCV in scikit-learn) and keep columns with nonzero coefficients.

Tree-based importance

Random forests and gradient boosting report split-gain importance. Train once, threshold on cumulative importance (e.g., keep features summing to 95% of total gain). Beware: importance favors high-cardinality columns; use permutation importance on a held-out set to validate that a “top” feature actually hurts metrics when shuffled.

Regularized generalized linear models

Logistic regression with L1 penalty performs embedded selection for classification. For wide sparse text matrices, linear SVMs with L1 are an alternative. Embedded paths shine when p >> n and you need one training pass.

Multicollinearity and stability

Highly correlated features do not always hurt tree models, but they destabilize linear coefficients: small data shifts flip signs and magnitudes. The variance inflation factor (VIF) measures how much variance of coefficient j inflates due to correlation with other columns. Rule of thumb: investigate VIF > 5–10; drop or combine collinear groups.

Domain knowledge resolves ambiguity filters cannot: annual_income and monthly_income are redundant — keep one. Pair VIF screening with business semantics before trusting automated drops.

Worked example: Harbor Payments credit default model

Harbor Payments trains a logistic default model on 380 applicant features: bureau tradelines, cash-flow aggregates, device fingerprints, and merchant category one-hots. Offline AUC is 0.91 with all columns; production inference p95 latency is 42 ms — too slow for real-time checkout.

Stage 1: filter pass

Drop 47 near-zero-variance device flags. Remove 12 columns with >40% missing (no missingness signal after encoding). Chi-square on binned numerics eliminates 89 weak category dummies. MI keeps top 120 of the remainder. Runtime: seconds.

Stage 2: embedded LassoCV

Standard-scale continuous columns; one-hot high-cardinality merchants with frequency encoding inside a pipeline. LassoCV with 5-fold stratified CV zeros out 61 of 120 features. Validation AUC: 0.908 — negligible loss from 380 to 59 columns.

Stage 3: RFE confirmation

RFECV with logistic regression on the 59 survivors suggests 44 features are sufficient (AUC 0.907). Permutation importance on the holdout month confirms debt_to_income, months_on_file, and avg_daily_balance_90d drive most gain; three device timezone columns were filter survivors but permutation shows near-zero impact — removed.

Outcome

Final model: 41 features, validation AUC 0.907, inference p95 11 ms. Compliance documentation lists each retained column with MI rank, Lasso coefficient sign, and permutation delta-AUC. Retrain monthly inside a sklearn Pipeline so selection steps refit only on training folds.

Method decision table

Method	Speed	Captures interactions	Best when
Variance / missing filter	Very fast	No	First pass on any wide table; removes obvious dead columns
Mutual information / chi-square	Fast	No (univariate)	Classification with hundreds of candidates; quick shortlist
RFE / sequential search	Slow	Yes	Moderate feature count after filtering; need model-aware subset
Lasso / elastic net	Medium	Partial (linear)	High-dimensional linear models; `p >> n` sparse solutions
Tree importance + permutation	Medium	Yes	Nonlinear tabular; validate importance with shuffle tests
PCA / autoencoders	Medium	Yes (latent)	Interpretability optional; aggressive compression needed

CV-safe pipelines (non-negotiable)

Fitting a selector on the full dataset before cross-validation leaks label information into every fold. The fix: nest selection inside each training fold.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegressionCV

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("select", SelectFromModel(LogisticRegressionCV(penalty="l1", solver="saga", cv=5))),
    ("clf", LogisticRegressionCV(cv=5)),
])

cross_val_score(pipe, X, y, cv=5) now reflects honest generalization. Persist the entire pipeline with joblib so production applies the same scale-then-select-then-predict steps fitted on the final training window.

Common pitfalls

Selecting on the test set — any tuning of k or thresholds using holdout data inflates reported metrics; use nested CV or a dedicated validation split.
Univariate filters on interacting features — pairwise MI or wrapper search may be required when signal lives in combinations.
Ignoring leakage columns — features computed after the prediction point (post-default collections activity) score high in MI but fail in production; audit timelines before selection.
Trusting default tree importance — high-cardinality categoricals dominate split counts; confirm with permutation importance.
Unscaled Lasso on mixed units — income in dollars and age in years penalized on different scales; always scale before L1.
Stable feature myth across retrains — monthly retrains may swap borderline columns; monitor set overlap and coefficient sign stability.
Dropping protected proxies carelessly — zip code may proxy for demographics; selection does not remove fairness obligations.

Production checklist

Document baseline metrics with all features vs selected subset on the same CV splits.
Run variance, missingness, and MI filters as stage one on wide tables.
Check VIF on linear model finalists; resolve redundant business duplicates manually.
Nest selectors inside Pipeline; never fit selectors outside CV loops.
Validate surviving features with permutation importance on a recent holdout month.
Log selected feature names, version, and selection method in the model registry.
Measure inference latency and model size before and after pruning.
Monitor feature-set drift: alert when >20% of columns change between retrains.
Publish a model card listing each retained column and why it survived.
Re-run selection when label definition or data source schema changes.

Key takeaways

Feature selection prunes; engineering creates — use both, in that order, on wide tabular problems.
Filters are fast first passes; wrappers and embedded methods capture model-specific signal — combine stages rather than picking one religion.
Nested pipelines prevent selection leakage — the selector must refit inside each training fold.
Permutation importance validates tree rankings — never ship based on split gain alone.
Fewer strong features beat hundreds of weak ones for speed, stability, and auditability.