Guide

Logistic regression explained

Linear regression predicts a number; logistic regression predicts a probability that an example belongs to the positive class. Banks use it for credit default, clinics for disease risk, marketers for click-through, and engineers as a fast baseline before trying gradient-boosted trees or neural nets. The model is deceptively simple: a weighted sum of features passed through a sigmoid squashes scores into the 0–1 range. Training maximizes likelihood (or equivalently minimizes log-loss), and the learned coefficients have a clean interpretation as log-odds multipliers. This guide walks through the sigmoid and decision boundary, the log-loss objective, regularization, multiclass softmax extensions, calibration, and when logistic regression is the right tool — building on machine learning fundamentals and pairing naturally with ROC-AUC evaluation.

From linear scores to probabilities

Given feature vector x and weights w plus bias b, logistic regression computes a linear score z = w·x + b. Instead of returning z directly, it applies the sigmoid (logistic) function:

σ(z) = 1 / (1 + e^−z)

The output p = σ(z) is interpreted as P(y = 1 | x). The sigmoid is smooth, differentiable, and maps any real number to (0, 1). Large positive z pushes probability toward 1; large negative z toward 0; z = 0 gives exactly 0.5.

Geometrically, the decision boundary where p = 0.5 is the hyperplane w·x + b = 0 — a line in two dimensions, a plane in three. Logistic regression is a linear classifier: it can only separate classes with a single linear boundary in feature space. Nonlinear problems need feature engineering (polynomial terms, binning) or nonlinear models like tree ensembles.

Odds, log-odds, and coefficient interpretation

Odds are the ratio of success to failure: odds = p / (1 − p). The model's linear part is actually the log-odds (logit): log(p / (1 − p)) = w·x + b. Each coefficient w_j tells you how a one-unit increase in feature x_j multiplies the odds, holding other features fixed. If w_age = 0.04, each additional year multiplies odds by e^0.04 ≈ 1.04 — a 4% increase in odds per year.

This interpretability is why logistic regression remains popular in regulated industries: you can document which features push risk up or down and by how much, something opaque deep nets struggle to provide without post-hoc tools.

Training: maximum likelihood and log-loss

Unlike linear regression (which uses squared error), logistic regression is trained with binary cross-entropy, also called log-loss. For a single example with true label y ∈ {0, 1} and predicted probability p:

L = −[y log(p) + (1 − y) log(1 − p)]

The loss penalizes confident wrong predictions heavily: predicting p = 0.99 when y = 0 incurs a large penalty. Across the training set, we minimize the average log-loss — equivalent to maximizing the likelihood of the observed labels under the model.

There is no closed-form solution like ordinary least squares. Optimizers use gradient descent, Newton methods, or L-BFGS to find weights. Libraries such as scikit-learn's LogisticRegression default to efficient solvers with built-in regularization. Convergence is fast on tabular data with thousands of features and millions of rows.

Class imbalance and sample weights

When positives are rare (fraud, disease), the model may learn to always predict the majority class and still achieve low log-loss. Mitigations:

Class weights — upweight minority-class errors in the loss.
Resampling — oversample positives or undersample negatives (watch for leakage if duplicates appear in validation).
Threshold tuning — pick a cutoff other than 0.5 using validation data and metrics like precision-recall rather than default accuracy.

Regularization: L1, L2, and the bias-variance tradeoff

With many correlated features, logistic regression can overfit — weights blow up to fit noise. L2 regularization (ridge) adds λ Σ w_j² to the loss, shrinking coefficients toward zero without eliminating them. L1 regularization (lasso) adds λ Σ |w_j| and can drive some weights exactly to zero, performing automatic feature selection.

The hyperparameter C in scikit-learn is the inverse of regularization strength: large C means less penalty (more complex model). Tune C on a held-out validation set or via cross-validation — never on the test set you report final metrics on.

Always scale numeric features (standardization or min-max) before training regularized logistic regression. Otherwise, features with larger raw ranges receive smaller effective penalties.

Multiclass extensions: one-vs-rest and softmax

Binary logistic regression extends to K > 2 classes in two common ways:

One-vs-rest (OvR) — train K separate binary classifiers, each distinguishing one class from all others. Simple and parallelizable; probabilities may not sum to 1 without calibration.
Multinomial (softmax) logistic regression — one coefficient vector per class; softmax converts K scores into a proper probability distribution that sums to 1. Preferred when you need mutually exclusive class probabilities.

scikit-learn's multi_class='multinomial' with the lbfgs or saga solver fits softmax directly. For extreme class counts (thousands of labels), OvR or hierarchical approaches may be more practical.

Calibration: when predicted probabilities lie

Logistic regression outputs are often reasonably calibrated out of the box — if the model says 0.7, roughly 70% of those cases should be positive. Complex models (gradient-boosted trees, neural nets) frequently produce sharp scores that rank well but miscalibrate probabilities.

Check calibration with a reliability diagram: bin predictions by probability, plot mean predicted vs mean observed rate. If curves deviate from the diagonal, apply Platt scaling (fit a secondary logistic regression on validation scores) or isotonic regression. Calibration matters when probabilities drive decisions — loan pricing, insurance premiums, or budget allocation — not just ranking.

Logistic regression vs other models

Model	Strengths	Weaknesses
Logistic regression	Fast, interpretable coefficients, decent calibration, strong tabular baseline	Linear decision boundary only; weak on raw images/text without features
Gradient-boosted trees	Handles nonlinearities and interactions; often wins Kaggle tabular	Slower inference; less interpretable; may need calibration
Neural networks	Flexible on unstructured data (vision, language)	Needs more data and compute; opaque without explanation tools
Naive Bayes	Very fast training; good text baseline with bag-of-words	Strong independence assumptions; often miscalibrated

A practical workflow: start with logistic regression on properly engineered features, measure ROC-AUC and calibration, then try gradient boosting if you need more accuracy and can sacrifice some interpretability.

Common mistakes

Using accuracy on imbalanced data — a model that never flags fraud can score 99% accuracy. Report precision, recall, F1, and AUC.
Data leakage in features — including post-outcome columns (e.g. "chargeback filed") makes coefficients look miraculous and fails in production.
Forgetting to scale — regularization and convergence suffer when features have different units.
Extrapolating coefficients causally — logistic regression shows association, not causation, unless you designed a randomized experiment.
Ignoring multicollinearity — correlated features inflate coefficient variance; consider PCA, dropping collinear columns, or L2 penalty.
Deploying without monitoring — population drift changes calibration; retrain or recalibrate when feature distributions shift.

Production checklist

Split data train / validation / test with temporal or group constraints if leakage is possible.
Engineer and scale features; fit scaler on training data only.
Tune regularization C via cross-validation on training data.
Handle class imbalance with weights or resampling; pick threshold on validation set for your cost matrix.
Evaluate AUC, precision-recall, and calibration on held-out test data.
Document coefficient signs and magnitudes for stakeholders.
Serialize model + scaler + feature list as a versioned artifact.
Monitor score distribution and outcome rates in production; alert on drift.

Key takeaways

Logistic regression maps linear scores through a sigmoid to produce class probabilities with interpretable log-odds coefficients.
Log-loss training penalizes confident mistakes and is equivalent to maximum likelihood for Bernoulli labels.
L1/L2 regularization controls complexity; always scale features and tune C with proper validation.
Multiclass uses OvR or softmax; choose based on whether you need a proper probability distribution.
Calibration and thresholds matter as much as AUC when probabilities drive business decisions.