Guide

Linear regression explained

A rental platform wants to estimate monthly rent from square footage, bedroom count, and neighborhood. A marketing team forecasts revenue from ad spend. Both problems share the same structure: predict a continuous number from measurable inputs. Linear regression models that relationship as a weighted sum of features plus an intercept — the simplest supervised learner and still one of the most useful. This guide covers the least-squares objective and closed-form OLS solution, gradient descent as an alternative, interpreting coefficients and R², evaluation metrics (RMSE, MAE), Ridge and Lasso regularization, polynomial and interaction terms, classical assumption checks, a housing-price worked example, a model-selection decision table, common pitfalls, and a practitioner checklist. For the classification sibling, see logistic regression.

The model

Given features x₁, x₂, …, xₙ and a continuous target y, linear regression predicts:

ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ

In matrix form: ŷ = Xβ, where X is the design matrix (rows = observations, columns = features plus a column of ones for the intercept) and β is the coefficient vector. The model assumes the true relationship is approximately linear in the features you provide — which is why feature engineering (log transforms, polynomials, interactions) often matters more than switching to a fancier algorithm.

Training finds coefficients that minimize prediction error on historical data. Unlike logistic regression, there is no sigmoid: the output is an unbounded real number. That makes linear regression ideal for price, demand, temperature, latency, and any other quantity where "probability" is not the right framing.

Ordinary least squares (OLS)

The standard objective is sum of squared residuals (also called mean squared error when averaged):

min Σ (yᵢ − ŷᵢ)²

Squaring penalizes large errors disproportionately and yields a smooth, convex loss with a unique global minimum (when X has full column rank). The closed-form solution — when you can compute it — is:

β = (XᵀX)⁻¹ Xᵀy

Libraries like scikit-learn's LinearRegression solve this via LAPACK (SVD-based) for numerical stability. For very wide or sparse matrices, iterative methods (coordinate descent, stochastic gradient descent) scale better than materializing XᵀX.

Gradient descent updates coefficients in the direction that reduces MSE: compute the residual vector, backpropagate through the linear layer, and step with a learning rate. Mini-batch SGD handles datasets that do not fit in memory. In practice, OLS and GD converge to the same solution given enough iterations and appropriate learning-rate scheduling.

Interpreting coefficients

Each βⱼ answers: holding all other features fixed, how much does y change when xⱼ increases by one unit? That ceteris paribus interpretation is linear regression's superpower — stakeholders can read coefficients like a spreadsheet formula.

Caveats apply immediately:

Scale matters. A coefficient on "income in dollars" is tiny; on "income in thousands" it is 1,000 times larger. Standardize features when comparing relative importance or when using regularization.
Correlation is not causation. A positive coefficient on ice-cream sales and drowning deaths reflects summer weather, not causality.
Multicollinearity inflates variance. When two features move together (square footage and bedroom count), individual coefficients become unstable even if joint predictions remain accurate. Check variance inflation factors (VIF) or use feature selection to drop redundant columns.

The intercept β₀ is the predicted y when all features are zero — often meaningless (zero bedrooms and zero square feet is not a real house) but necessary for the line to pass through the data cloud correctly.

Evaluation metrics

Fit quality is measured on held-out data, never training residuals alone — see cross-validation for why.

RMSE (root mean squared error): square root of average squared residual. Same units as y; penalizes large misses. Default choice for regression.
MAE (mean absolute error): average absolute residual. Robust to outliers; easier to explain ("off by $120 on average").
R² (coefficient of determination): fraction of variance in y explained by the model, between 0 and 1 (can be negative on test data if the model is worse than predicting the mean). High R² on training data with low test R² signals overfitting.
Adjusted R²: penalizes adding features that do not improve test performance — prefer this when comparing models with different feature counts.

Report confidence intervals on coefficients (via bootstrap or analytical standard errors) when decisions depend on whether a coefficient is significantly different from zero — the same logic as hypothesis testing in experimental analysis.

Regularization: Ridge, Lasso, and Elastic Net

Unregularized OLS can overfit when you have many features relative to samples, or when features are collinear. Regularization adds a penalty on coefficient magnitude:

Ridge (L2): adds λ Σ βⱼ². Shrinks all coefficients toward zero but rarely to exactly zero. Handles multicollinearity gracefully by distributing weight across correlated features.
Lasso (L1): adds λ Σ |βⱼ|. Can zero out coefficients entirely — built-in feature selection. Unstable when features are highly correlated (picks one arbitrarily).
Elastic Net: combines L1 and L2. Best of both when you have grouped correlated features and want sparsity.

The hyperparameter λ (or α in scikit-learn) controls penalty strength. Tune it with cross-validation on a log-spaced grid — never on the final test set. Regularization interacts with feature scaling: always standardize before Ridge/Lasso.

Beyond straight lines

"Linear" refers to linearity in coefficients, not necessarily in raw inputs. Common extensions:

Polynomial features: add x², x³, or cross-terms like x₁x₂ to capture curvature and interaction effects while staying in the linear-regression framework.
Log transforms: model log(y) when the target is right-skewed (prices, counts). Coefficients then approximate percentage changes.
Dummy variables: encode categories (neighborhood, product line) as 0/1 columns. One category is dropped to avoid perfect multicollinearity (the dummy-variable trap).

Each added term increases model flexibility and overfitting risk. Use adjusted R², cross-validated RMSE, or regularization to keep complexity honest.

Classical assumptions

OLS inference (p-values on coefficients, confidence intervals) rests on assumptions worth checking on residuals:

Linearity: residuals vs fitted values should show no systematic curve. Try polynomial terms or splines if you see a pattern.
Independence: observations should not influence each other. Time-series and spatial data violate this — use specialized models.
Homoscedasticity: residual spread should be constant across fitted values. A funnel shape suggests weighted least squares or a log transform.
Normality of residuals: needed for exact p-values in small samples; less critical with large n thanks to the central limit theorem.

Violating assumptions does not always ruin predictions — OLS can still minimize squared error — but it undermines statistical inference and may produce miscalibrated uncertainty estimates.

Worked example: predicting rent

Suppose you have 8,000 rental listings with features: square feet, bedrooms, bathrooms, distance to transit (km), and a neighborhood dummy. Target: monthly rent in dollars.

Split: 80/20 train/test; set aside another fold for hyperparameter tuning if using Ridge.
Clean: drop rows with missing values or impute medians; cap extreme outliers (a 50,000 sq ft listing is likely data entry error).
Engineer: log-transform square feet; add bedrooms × bathrooms interaction if studio vs family units behave differently.
Fit: start with OLS baseline. Report test RMSE ($187) and R² (0.72).
Diagnose: residual plot shows heteroscedasticity at high rents — refit with log(rent) as target; RMSE on log scale improves; back-transform predictions with bias correction.
Regularize: Ridge with α = 1.2 via 5-fold CV shaves 3% off test RMSE when you add 40 neighborhood dummies.
Interpret: coefficient on log-sqft ≈ 0.85 means a 10% larger unit associates with ~8.5% higher rent, holding other features fixed.

Ship the Ridge model if it wins on CV; keep OLS coefficients in a dashboard for stakeholder transparency.

When to use linear regression

Scenario	Linear regression	Alternative
Interpretable coefficients required	Strong fit	Shallow decision tree (less stable)
Small tabular dataset (< 10k rows)	Strong baseline	Gradient boosting if nonlinear
Binary or categorical target	Wrong tool	Logistic regression or classifier
Heavy nonlinear interactions, images, text	Weak alone	Neural nets, gradient boosting
Many correlated features, need sparsity	Lasso / Elastic Net	PCA + regression
Extrapolation beyond training range	Dangerous	Constrain predictions or use domain bounds

Common pitfalls

Data leakage: including a feature derived from the target (e.g. "days until sold" when predicting sale price) inflates R² artificially.
Extrapolation: linear models confidently predict nonsense outside the training range. Clamp or flag out-of-distribution inputs.
Ignoring outliers: a single bad row can swing OLS coefficients. Use robust regression (Huber) or winsorize extremes.
Chasing training R²: adding features always helps in-sample. Judge on held-out RMSE.
Unscaled regularization: Ridge/Lasso penalties are not comparable across features on different scales.
Confusing correlation with importance: a small coefficient on a high-variance feature may matter more than a large coefficient on a rarely-varying column.

Practitioner checklist

Confirm target is continuous and roughly linear in chosen features.
Split train/validation/test; never tune on test data.
Engineer transforms (log, polynomial, interactions) before abandoning linear models.
Check residual plots for nonlinearity and heteroscedasticity.
Inspect VIF or correlation matrix for multicollinearity.
Standardize features before Ridge, Lasso, or comparing coefficient magnitudes.
Cross-validate regularization strength on a log-spaced grid.
Report test RMSE and MAE with confidence intervals via bootstrap.
Document coefficient interpretations for stakeholders.
Monitor production drift — refit when feature distributions shift.

Key takeaways

Linear regression predicts continuous targets as a weighted sum of features — fast, interpretable, and often hard to beat on tabular data.
OLS minimizes squared error with a closed-form solution; Ridge and Lasso add penalties for high-dimensional or collinear settings.
Coefficients are actionable only when features are scaled, uncorrelated, and causally defensible.
Judge models on held-out RMSE, not training R².
Feature engineering and regularization matter more than switching to a black-box model when the relationship is approximately linear.