Guide
Logistic regression explained
Linear regression predicts a number; logistic regression predicts a probability that an example belongs to the positive class. Banks use it for credit default, clinics for disease risk, marketers for click-through, and engineers as a fast baseline before trying gradient-boosted trees or neural nets. The model is deceptively simple: a weighted sum of features passed through a sigmoid squashes scores into the 0–1 range. Training maximizes likelihood (or equivalently minimizes log-loss), and the learned coefficients have a clean interpretation as log-odds multipliers. This guide walks through the sigmoid and decision boundary, the log-loss objective, regularization, multiclass softmax extensions, calibration, and when logistic regression is the right tool — building on machine learning fundamentals and pairing naturally with ROC-AUC evaluation.
From linear scores to probabilities
Given feature vector x and weights w plus bias
b, logistic regression computes a linear score
z = w·x + b. Instead of returning z directly, it
applies the sigmoid (logistic) function:
σ(z) = 1 / (1 + e−z)
The output p = σ(z) is interpreted as
P(y = 1 | x). The sigmoid is smooth, differentiable, and maps
any real number to (0, 1). Large positive z pushes probability
toward 1; large negative z toward 0; z = 0 gives
exactly 0.5.
Geometrically, the decision boundary where
p = 0.5 is the hyperplane w·x + b = 0 — a line in
two dimensions, a plane in three. Logistic regression is a linear
classifier: it can only separate classes with a single linear boundary
in feature space. Nonlinear problems need feature engineering (polynomial
terms, binning) or nonlinear models like tree ensembles.
Odds, log-odds, and coefficient interpretation
Odds are the ratio of success to failure:
odds = p / (1 − p). The model's linear part is actually the
log-odds (logit):
log(p / (1 − p)) = w·x + b. Each coefficient
wj tells you how a one-unit increase in feature
xj multiplies the odds, holding other features fixed.
If wage = 0.04, each additional year multiplies odds
by e0.04 ≈ 1.04 — a 4% increase in odds per year.
This interpretability is why logistic regression remains popular in regulated industries: you can document which features push risk up or down and by how much, something opaque deep nets struggle to provide without post-hoc tools.
Training: maximum likelihood and log-loss
Unlike linear regression (which uses squared error), logistic regression is
trained with binary cross-entropy, also called
log-loss. For a single example with true label
y ∈ {0, 1} and predicted probability p:
L = −[y log(p) + (1 − y) log(1 − p)]
The loss penalizes confident wrong predictions heavily: predicting
p = 0.99 when y = 0 incurs a large penalty. Across
the training set, we minimize the average log-loss — equivalent to maximizing
the likelihood of the observed labels under the model.
There is no closed-form solution like ordinary least squares. Optimizers use
gradient descent, Newton methods, or L-BFGS to find weights. Libraries such as
scikit-learn's LogisticRegression default to efficient solvers
with built-in regularization. Convergence is fast on tabular data with
thousands of features and millions of rows.
Class imbalance and sample weights
When positives are rare (fraud, disease), the model may learn to always predict the majority class and still achieve low log-loss. Mitigations:
- Class weights — upweight minority-class errors in the loss.
- Resampling — oversample positives or undersample negatives (watch for leakage if duplicates appear in validation).
- Threshold tuning — pick a cutoff other than 0.5 using validation data and metrics like precision-recall rather than default accuracy.
Regularization: L1, L2, and the bias-variance tradeoff
With many correlated features, logistic regression can overfit — weights blow
up to fit noise. L2 regularization (ridge) adds
λ Σ wj² to the loss, shrinking coefficients toward
zero without eliminating them. L1 regularization (lasso) adds
λ Σ |wj| and can drive some weights exactly to zero,
performing automatic feature selection.
The hyperparameter C in scikit-learn is the inverse of
regularization strength: large C means less penalty (more complex
model). Tune C on a held-out validation set or via
cross-validation
— never on the test set you report final metrics on.
Always scale numeric features (standardization or min-max) before training regularized logistic regression. Otherwise, features with larger raw ranges receive smaller effective penalties.
Multiclass extensions: one-vs-rest and softmax
Binary logistic regression extends to K > 2 classes in two
common ways:
- One-vs-rest (OvR) — train
Kseparate binary classifiers, each distinguishing one class from all others. Simple and parallelizable; probabilities may not sum to 1 without calibration. - Multinomial (softmax) logistic regression — one
coefficient vector per class; softmax converts
Kscores into a proper probability distribution that sums to 1. Preferred when you need mutually exclusive class probabilities.
scikit-learn's multi_class='multinomial' with the
lbfgs or saga solver fits softmax directly. For
extreme class counts (thousands of labels), OvR or hierarchical approaches
may be more practical.
Calibration: when predicted probabilities lie
Logistic regression outputs are often reasonably calibrated out of the box — if the model says 0.7, roughly 70% of those cases should be positive. Complex models (gradient-boosted trees, neural nets) frequently produce sharp scores that rank well but miscalibrate probabilities.
Check calibration with a reliability diagram: bin predictions by probability, plot mean predicted vs mean observed rate. If curves deviate from the diagonal, apply Platt scaling (fit a secondary logistic regression on validation scores) or isotonic regression. Calibration matters when probabilities drive decisions — loan pricing, insurance premiums, or budget allocation — not just ranking.
Logistic regression vs other models
| Model | Strengths | Weaknesses |
|---|---|---|
| Logistic regression | Fast, interpretable coefficients, decent calibration, strong tabular baseline | Linear decision boundary only; weak on raw images/text without features |
| Gradient-boosted trees | Handles nonlinearities and interactions; often wins Kaggle tabular | Slower inference; less interpretable; may need calibration |
| Neural networks | Flexible on unstructured data (vision, language) | Needs more data and compute; opaque without explanation tools |
| Naive Bayes | Very fast training; good text baseline with bag-of-words | Strong independence assumptions; often miscalibrated |
A practical workflow: start with logistic regression on properly engineered features, measure ROC-AUC and calibration, then try gradient boosting if you need more accuracy and can sacrifice some interpretability.
Common mistakes
- Using accuracy on imbalanced data — a model that never flags fraud can score 99% accuracy. Report precision, recall, F1, and AUC.
- Data leakage in features — including post-outcome columns (e.g. "chargeback filed") makes coefficients look miraculous and fails in production.
- Forgetting to scale — regularization and convergence suffer when features have different units.
- Extrapolating coefficients causally — logistic regression shows association, not causation, unless you designed a randomized experiment.
- Ignoring multicollinearity — correlated features inflate coefficient variance; consider PCA, dropping collinear columns, or L2 penalty.
- Deploying without monitoring — population drift changes calibration; retrain or recalibrate when feature distributions shift.
Production checklist
- Split data train / validation / test with temporal or group constraints if leakage is possible.
- Engineer and scale features; fit scaler on training data only.
- Tune regularization
Cvia cross-validation on training data. - Handle class imbalance with weights or resampling; pick threshold on validation set for your cost matrix.
- Evaluate AUC, precision-recall, and calibration on held-out test data.
- Document coefficient signs and magnitudes for stakeholders.
- Serialize model + scaler + feature list as a versioned artifact.
- Monitor score distribution and outcome rates in production; alert on drift.
Key takeaways
- Logistic regression maps linear scores through a sigmoid to produce class probabilities with interpretable log-odds coefficients.
- Log-loss training penalizes confident mistakes and is equivalent to maximum likelihood for Bernoulli labels.
- L1/L2 regularization controls complexity; always scale
features and tune
Cwith proper validation. - Multiclass uses OvR or softmax; choose based on whether you need a proper probability distribution.
- Calibration and thresholds matter as much as AUC when probabilities drive business decisions.
Related reading
- Machine learning fundamentals explained — supervised learning, train/test splits, and the bridge from classical models to deep learning
- Loss functions explained — binary cross-entropy, label smoothing, and why the objective must match your task
- ROC-AUC explained — threshold-independent ranking metrics for binary classifiers
- Overfitting and cross-validation explained — k-fold validation and regularization for reliable model selection