Guide
Linear algebra for machine learning explained
You do not need a math PhD to train a model — but you do need a working mental model of linear algebra. Every mini-batch is a matrix. Every dense layer is a matrix multiply. Every embedding similarity search is a dot product. Every weight update is a vector step along a gradient. When shapes mismatch or gradients vanish, the bug is almost always linear algebra wearing a framework error message. This guide covers the vectors, matrices, and operations that appear daily in machine learning: dot products and cosine similarity, matrix multiplication in transformer layers, norms behind L2 regularization, covariance and PCA intuition, eigenvalues at a practitioner level, gradients as vectors driving backpropagation, a Harbor product-embedding worked example, an operation decision table, common pitfalls, and a checklist before you ship a model.
Why linear algebra is the language of ML
Machine learning at scale is batch linear algebra with nonlinear activations sprinkled between matrix multiplies. A dataset of 32 images, each flattened to 784 pixels, is a 32 × 784 matrix. A fully connected layer mapping 784 inputs to 128 hidden units is a 784 × 128 weight matrix multiplied on the right by the batch. GPUs are fast precisely because they parallelize these operations across thousands of cores.
Three ideas unlock most practitioner confusion:
- Shape discipline — inner dimensions must align for multiplication; transposes swap row/column orientation.
- Geometry intuition — vectors are points or directions in space; dot products measure alignment; matrices transform space.
- Calculus connection — gradients are vectors of partial derivatives pointing uphill; optimizers walk the opposite direction.
Vectors and dot products
A vector is an ordered list of numbers — a point in n-dimensional space or a direction with magnitude. In ML, feature vectors describe one example; weight vectors parameterize a model; embedding vectors represent words, users, or products in a learned space.
The dot product (inner product) of vectors a and b sums element-wise products: a · b = Σ ai bi. Geometrically, a · b = |a| |b| cos(θ) — it is large when vectors point the same direction, zero when orthogonal, negative when opposed.
Cosine similarity
Cosine similarity normalizes dot products by vector lengths: cos(θ) = (a · b) / (|a| |b|). Recommendation systems and vector databases use cosine similarity because magnitude often carries less meaning than direction — two product embeddings with similar purchase patterns should score high even if one user's activity volume is larger.
Broadcasting
Frameworks like NumPy and PyTorch broadcast smaller tensors across larger ones — adding a bias vector to every row of a batch matrix without explicit loops. Broadcasting rules are linear algebra with convenience syntax; violating them produces silent wrong answers or shape errors.
Matrices and multiplication
A matrix is a rectangular grid of numbers. An m × n matrix has m rows and n columns. Matrix A (m × n) times matrix B (n × p) yields C (m × p). The inner dimensions n must match — the most common bug in custom layers.
In a neural network, if input batch X has shape (batch, in_features) and weights W have shape (in_features, out_features), then XW produces activations of shape (batch, out_features). Bias is added via broadcasting. Stacking layers chains these multiplies with activation functions between them.
Transpose and layout
Transpose flips rows and columns: (m × n) becomes (n × m). Attention in transformers computes QKT — query matrix times transposed key matrix — to score how much each token attends to every other token. Row-major vs column-major memory layout rarely matters in Python but matters when reading CUDA kernel docs or optimizing custom ops.
Identity and inverse (when they matter)
The identity matrix I leaves vectors unchanged: Ix = x. An inverse A-1 satisfies AA-1 = I — used explicitly in closed-form linear regression (w = (XTX)-1XTy) but almost never computed directly in deep learning because inversion is expensive and numerically fragile; iterative optimizers replace it.
Norms, regularization and distance
A norm measures vector size. The L2 norm (Euclidean length) is |x| = √(Σ xi2). L2 regularization (weight decay) penalizes large weights by adding λ|w|2 to the loss — encouraging smaller, smoother solutions that often generalize better.
The L1 norm sums absolute values: |x|1 = Σ |xi|. L1 regularization pushes many weights exactly to zero, producing sparse models — useful when you want feature selection. L2 distance between two points is the norm of their difference; k-nearest neighbors and clustering (k-means) rely on it.
Covariance, PCA and eigenvalues (intuition)
The covariance matrix of features describes how variables move together — diagonal entries are variances, off-diagonals are covariances. Highly correlated features carry redundant information; the covariance matrix captures that structure.
Principal Component Analysis (PCA) finds directions (eigenvectors) along which data varies most. An eigenvector v of matrix A satisfies Av = λv — the matrix scales v without rotating it. The scalar λ is the eigenvalue, the variance along that direction. PCA projects high-dimensional data onto the top eigenvectors, reducing dimensionality while preserving most variance — common for visualization, compression, and preprocessing tabular data before classical ML.
You rarely hand-compute eigendecompositions in deep learning pipelines, but the intuition explains why correlated inputs slow training, why whitening helps, and why singular value decomposition (SVD) appears in recommendation matrix factorization and latent semantic analysis.
Gradients: calculus meets vectors
A gradient is a vector of partial derivatives — one entry per parameter telling you how loss changes if you nudge that weight infinitesimally. For loss L(w) with weights w, the gradient ∇L points toward steepest ascent. Gradient descent steps w ← w - η∇L — move opposite the gradient with learning rate η.
Backpropagation applies the chain rule to compute ∇L efficiently through layered graphs. The Jacobian of a vector-valued function is a matrix of all first-order partial derivatives — RNNs and normalizing flows care deeply about Jacobian structure; for standard feedforward nets, framework autodiff hides the details.
Worked example: Harbor product embeddings
Harbor Shop wants "customers who viewed this also viewed" recommendations without hand-coded category rules. The team trains a shallow embedding model:
- Each of 50,000 products maps to a 64-dimensional embedding vector stored in matrix E (50000 × 64).
- A user's recent views form a context vector — the mean of those product embeddings, shape (64,).
- Candidate scores are dot products between the context vector and every product embedding — implemented as one matrix-vector multiply E c producing 50,000 scores.
- Top-10 highest scores (excluding already-viewed items) surface as recommendations.
Training uses contrastive loss: push dot products high for co-purchased pairs, low for random negatives. L2 normalization on embeddings makes dot product equivalent to cosine similarity — stabilizing scale across popular vs niche products. Before launch, engineers verify E has no NaN rows, embedding norms are bounded, and batch matrix shapes align when they add a two-tower variant with separate user and item matrices.
Operation decision table
| Goal | Operation | Typical ML use |
|---|---|---|
| Measure feature alignment | Dot product / cosine similarity | Embeddings, attention scores, retrieval |
| Transform batch of features | Matrix multiply | Dense layers, linear regression, projections |
| Penalize large weights | L2 norm squared | Weight decay regularization |
| Sparsify features or weights | L1 norm | Lasso, sparse attention masks |
| Reduce input dimensionality | PCA / SVD | Tabular preprocessing, visualization, compression |
| Update model parameters | Gradient vector step | SGD, Adam, all training loops |
| Combine correlated feature info | Covariance matrix | Whitening, Mahalanobis distance, Gaussian models |
Common pitfalls
- Shape mismatch — multiplying (batch, 512) by (256, 128)
fails silently only if your framework catches it; always print
.shapeon new layers. - Confusing dot product with element-wise multiply — Hadamard product is not the same as inner product; attention uses matmul, not element-wise ops.
- Ignoring numerical stability — softmax and attention subtract row maxima before exp to avoid overflow; naive dot products of large vectors can overflow float32.
- Treating correlation as causation in PCA — principal components capture variance, not causal structure; interpret with care.
- Inverting ill-conditioned matrices — near-singular XTX blows up closed-form solutions; use ridge regression or iterative methods instead.
- Gradient shape bugs — gradient must match parameter shape exactly; a transposed weight matrix often means your backward pass is wrong.
Practitioner checklist
- Write tensor shapes as comments on every custom layer forward pass.
- Normalize embeddings when using dot-product retrieval at scale.
- Check condition number or use regularization before matrix inversion.
- Visualize PCA of validation embeddings to spot collapsed or disconnected clusters.
- Unit-test a single forward/backward step with small known matrices.
- Profile whether your bottleneck is matmul (compute-bound) or memory bandwidth.
- When metrics degrade, verify covariance drift — input distribution shift changes geometry.
- Keep a reference sheet of (batch, seq, dim) conventions for your codebase.
Key takeaways
- Linear algebra is how ML represents data (matrices), transforms it (multiplication), and learns (gradient vectors).
- Dot products measure similarity; matrix multiply implements every dense layer and attention score block.
- Norms underpin L1/L2 regularization and distance-based algorithms like k-NN.
- Covariance, PCA, and eigenvalues explain redundancy in features and dimensionality reduction.
- Shape discipline prevents more production bugs than any hyperparameter sweep.
Related reading
- Machine learning fundamentals explained — supervised learning, loss, and evaluation before the math deepens
- Backpropagation explained — chain rule and gradient flow through computational graphs
- Transformer architecture explained — QKT attention as batched matrix multiply
- Linear regression explained — closed-form solution using normal equations