Guide
Linear regression explained
A rental platform wants to estimate monthly rent from square footage, bedroom
count, and neighborhood. A marketing team forecasts revenue from ad spend.
Both problems share the same structure: predict a continuous
number from measurable inputs. Linear regression
models that relationship as a weighted sum of features plus an intercept —
the simplest supervised learner and still one of the most useful. This guide
covers the least-squares objective and closed-form OLS solution, gradient
descent as an alternative, interpreting coefficients and
R², evaluation metrics (RMSE, MAE), Ridge and Lasso
regularization, polynomial and interaction terms, classical assumption
checks, a housing-price worked example, a model-selection decision table,
common pitfalls, and a practitioner checklist. For the classification sibling,
see
logistic regression.
The model
Given features x₁, x₂, …, xₙ and a continuous target
y, linear regression predicts:
ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ
In matrix form: ŷ = Xβ, where X is the design
matrix (rows = observations, columns = features plus a column of ones for
the intercept) and β is the coefficient vector. The model
assumes the true relationship is approximately linear in the features you
provide — which is why
feature engineering
(log transforms, polynomials, interactions) often matters more than
switching to a fancier algorithm.
Training finds coefficients that minimize prediction error on historical data. Unlike logistic regression, there is no sigmoid: the output is an unbounded real number. That makes linear regression ideal for price, demand, temperature, latency, and any other quantity where "probability" is not the right framing.
Ordinary least squares (OLS)
The standard objective is sum of squared residuals (also called mean squared error when averaged):
min Σ (yᵢ − ŷᵢ)²
Squaring penalizes large errors disproportionately and yields a smooth,
convex loss with a unique global minimum (when X has full
column rank). The closed-form solution — when you can compute it — is:
β = (XᵀX)⁻¹ Xᵀy
Libraries like scikit-learn's LinearRegression solve this via
LAPACK (SVD-based) for numerical stability. For very wide or sparse
matrices, iterative methods (coordinate descent, stochastic gradient
descent) scale better than materializing XᵀX.
Gradient descent updates coefficients in the direction that reduces MSE: compute the residual vector, backpropagate through the linear layer, and step with a learning rate. Mini-batch SGD handles datasets that do not fit in memory. In practice, OLS and GD converge to the same solution given enough iterations and appropriate learning-rate scheduling.
Interpreting coefficients
Each βⱼ answers: holding all other features fixed, how
much does y change when xⱼ increases by one
unit? That ceteris paribus interpretation is linear regression's
superpower — stakeholders can read coefficients like a spreadsheet formula.
Caveats apply immediately:
- Scale matters. A coefficient on "income in dollars" is tiny; on "income in thousands" it is 1,000 times larger. Standardize features when comparing relative importance or when using regularization.
- Correlation is not causation. A positive coefficient on ice-cream sales and drowning deaths reflects summer weather, not causality.
- Multicollinearity inflates variance. When two features move together (square footage and bedroom count), individual coefficients become unstable even if joint predictions remain accurate. Check variance inflation factors (VIF) or use feature selection to drop redundant columns.
The intercept β₀ is the predicted y when all
features are zero — often meaningless (zero bedrooms and zero square feet
is not a real house) but necessary for the line to pass through the data
cloud correctly.
Evaluation metrics
Fit quality is measured on held-out data, never training residuals alone — see cross-validation for why.
- RMSE (root mean squared error): square root of average
squared residual. Same units as
y; penalizes large misses. Default choice for regression. - MAE (mean absolute error): average absolute residual. Robust to outliers; easier to explain ("off by $120 on average").
- R² (coefficient of determination): fraction of variance
in
yexplained by the model, between 0 and 1 (can be negative on test data if the model is worse than predicting the mean). HighR²on training data with low testR²signals overfitting. - Adjusted R²: penalizes adding features that do not improve test performance — prefer this when comparing models with different feature counts.
Report confidence intervals on coefficients (via bootstrap or analytical standard errors) when decisions depend on whether a coefficient is significantly different from zero — the same logic as hypothesis testing in experimental analysis.
Regularization: Ridge, Lasso, and Elastic Net
Unregularized OLS can overfit when you have many features relative to samples, or when features are collinear. Regularization adds a penalty on coefficient magnitude:
- Ridge (L2): adds
λ Σ βⱼ². Shrinks all coefficients toward zero but rarely to exactly zero. Handles multicollinearity gracefully by distributing weight across correlated features. - Lasso (L1): adds
λ Σ |βⱼ|. Can zero out coefficients entirely — built-in feature selection. Unstable when features are highly correlated (picks one arbitrarily). - Elastic Net: combines L1 and L2. Best of both when you have grouped correlated features and want sparsity.
The hyperparameter λ (or α in scikit-learn)
controls penalty strength. Tune it with cross-validation on a log-spaced
grid — never on the final test set. Regularization interacts with feature
scaling: always standardize before Ridge/Lasso.
Beyond straight lines
"Linear" refers to linearity in coefficients, not necessarily in raw inputs. Common extensions:
- Polynomial features: add
x²,x³, or cross-terms likex₁x₂to capture curvature and interaction effects while staying in the linear-regression framework. - Log transforms: model
log(y)when the target is right-skewed (prices, counts). Coefficients then approximate percentage changes. - Dummy variables: encode categories (neighborhood, product line) as 0/1 columns. One category is dropped to avoid perfect multicollinearity (the dummy-variable trap).
Each added term increases model flexibility and overfitting risk. Use
adjusted R², cross-validated RMSE, or regularization to keep
complexity honest.
Classical assumptions
OLS inference (p-values on coefficients, confidence intervals) rests on assumptions worth checking on residuals:
- Linearity: residuals vs fitted values should show no systematic curve. Try polynomial terms or splines if you see a pattern.
- Independence: observations should not influence each other. Time-series and spatial data violate this — use specialized models.
- Homoscedasticity: residual spread should be constant across fitted values. A funnel shape suggests weighted least squares or a log transform.
- Normality of residuals: needed for exact p-values in
small samples; less critical with large
nthanks to the central limit theorem.
Violating assumptions does not always ruin predictions — OLS can still minimize squared error — but it undermines statistical inference and may produce miscalibrated uncertainty estimates.
Worked example: predicting rent
Suppose you have 8,000 rental listings with features: square feet, bedrooms, bathrooms, distance to transit (km), and a neighborhood dummy. Target: monthly rent in dollars.
- Split: 80/20 train/test; set aside another fold for hyperparameter tuning if using Ridge.
- Clean: drop rows with missing values or impute medians; cap extreme outliers (a 50,000 sq ft listing is likely data entry error).
- Engineer: log-transform square feet; add
bedrooms × bathroomsinteraction if studio vs family units behave differently. - Fit: start with OLS baseline. Report test RMSE ($187)
and
R²(0.72). - Diagnose: residual plot shows heteroscedasticity at high
rents — refit with
log(rent)as target; RMSE on log scale improves; back-transform predictions with bias correction. - Regularize: Ridge with
α = 1.2via 5-fold CV shaves 3% off test RMSE when you add 40 neighborhood dummies. - Interpret: coefficient on log-sqft ≈ 0.85 means a 10% larger unit associates with ~8.5% higher rent, holding other features fixed.
Ship the Ridge model if it wins on CV; keep OLS coefficients in a dashboard for stakeholder transparency.
When to use linear regression
| Scenario | Linear regression | Alternative |
|---|---|---|
| Interpretable coefficients required | Strong fit | Shallow decision tree (less stable) |
| Small tabular dataset (< 10k rows) | Strong baseline | Gradient boosting if nonlinear |
| Binary or categorical target | Wrong tool | Logistic regression or classifier |
| Heavy nonlinear interactions, images, text | Weak alone | Neural nets, gradient boosting |
| Many correlated features, need sparsity | Lasso / Elastic Net | PCA + regression |
| Extrapolation beyond training range | Dangerous | Constrain predictions or use domain bounds |
Common pitfalls
- Data leakage: including a feature derived from the
target (e.g. "days until sold" when predicting sale price) inflates
R²artificially. - Extrapolation: linear models confidently predict nonsense outside the training range. Clamp or flag out-of-distribution inputs.
- Ignoring outliers: a single bad row can swing OLS coefficients. Use robust regression (Huber) or winsorize extremes.
- Chasing training
R²: adding features always helps in-sample. Judge on held-out RMSE. - Unscaled regularization: Ridge/Lasso penalties are not comparable across features on different scales.
- Confusing correlation with importance: a small coefficient on a high-variance feature may matter more than a large coefficient on a rarely-varying column.
Practitioner checklist
- Confirm target is continuous and roughly linear in chosen features.
- Split train/validation/test; never tune on test data.
- Engineer transforms (log, polynomial, interactions) before abandoning linear models.
- Check residual plots for nonlinearity and heteroscedasticity.
- Inspect VIF or correlation matrix for multicollinearity.
- Standardize features before Ridge, Lasso, or comparing coefficient magnitudes.
- Cross-validate regularization strength on a log-spaced grid.
- Report test RMSE and MAE with confidence intervals via bootstrap.
- Document coefficient interpretations for stakeholders.
- Monitor production drift — refit when feature distributions shift.
Key takeaways
- Linear regression predicts continuous targets as a weighted sum of features — fast, interpretable, and often hard to beat on tabular data.
- OLS minimizes squared error with a closed-form solution; Ridge and Lasso add penalties for high-dimensional or collinear settings.
- Coefficients are actionable only when features are scaled, uncorrelated, and causally defensible.
- Judge models on held-out RMSE, not training
R². - Feature engineering and regularization matter more than switching to a black-box model when the relationship is approximately linear.
Related reading
- Logistic regression explained — sigmoid mapping, log-loss, and binary classification
- Feature engineering explained — transforms and encodings that make linear models work
- Overfitting and cross-validation explained — honest evaluation and hyperparameter tuning
- Bias-variance tradeoff explained — why simple models sometimes generalize better