Guide

Convolutional neural networks (CNN) explained

Feed a 224×224 RGB photo into a fully connected neural network and you need millions of weights before the first hidden layer — and the model still treats pixel (12, 40) as unrelated to pixel (13, 41), even though real objects span neighborhoods of pixels. Convolutional neural networks (CNNs) fix both problems: they share weights across space so edges and textures are detected everywhere in the frame, and they build a hierarchy from local strokes to global shapes. This guide walks through the convolution operation, kernels and feature maps, pooling, receptive fields, landmark architectures from LeNet to ResNet, how CNNs connect to computer vision tasks, and practical training with transfer learning and data augmentation.

Why convolutions beat dense layers on images

Images have three properties that dense (fully connected) layers ignore:

Locality — a cat's ear is made of nearby pixels, not random pixels across the frame.
Translation equivariance — if the cat shifts left, the "ear detector" should shift left too, not require relearning.
Parameter efficiency — a 3×3 kernel with 9 weights applied across a 512×512 image replaces learning separate weights for every position.

A CNN stacks layers that apply the same small filter everywhere, then progressively downsample spatial resolution while increasing channel depth. Early channels respond to oriented edges and color blobs; deeper channels compose those into parts (wheels, eyes) and whole objects. This inductive bias is why CNNs dominated deep learning for vision for a decade before vision transformers matured — and why CNN backbones still ship on phones and embedded cameras today.

The convolution operation step by step

At the heart of every CNN is 2D convolution (technically cross-correlation in most frameworks, but practitioners say "conv").

Kernels, stride, and padding

A kernel (filter) is a small matrix of learnable weights — often 3×3 or 5×5. The kernel slides across the input height and width. At each position it element-wise multiplies overlapping input values, sums the products, adds a bias, and writes one number to the output feature map.

Three hyperparameters control output size:

Stride — how many pixels the kernel jumps per step. Stride 2 halves spatial dimensions, trading resolution for speed.
Padding — zero-padding around the border preserves spatial size when stride is 1 ("same" padding).
Kernel size — larger kernels see more context per step but cost more compute; modern stacks prefer stacked 3×3 layers over single 7×7 convolutions.

Output height for one dimension follows: (input + 2×padding − kernel) / stride + 1. A stack of conv layers grows the receptive field — the region of the input that influences one output neuron. Two 3×3 layers have the receptive field of one 5×5 layer but with more nonlinearity and fewer parameters.

Channels and depth

RGB input has three channels. A conv layer with 64 filters outputs 64 feature maps — each learns a different pattern (vertical edge, red blob, high-frequency noise). The next layer's 64 filters each convolve across all 64 input channels and sum into one output map, so depth increases while spatial size often shrinks.

Pooling, activation, and normalization

Pooling layers

Max pooling (most common) takes the maximum value in each 2×2 window with stride 2, halving width and height. It provides local translation invariance — a slightly shifted edge still activates the same pooled cell — and reduces compute for deeper layers. Average pooling appears in some architectures and as a global pooling step before classification.

Modern designs sometimes replace pooling with strided convolutions, letting the network learn how to downsample rather than fixing max aggregation.

Nonlinearities and batch norm

Conv layers are linear; ReLU (max(0, x)) after each conv introduces nonlinearity so stacked layers can approximate complex functions. Batch normalization standardizes activations per mini-batch, stabilizing training of deep stacks and allowing higher learning rates. LayerNorm and GroupNorm appear when batch sizes are tiny (detection) or on edge devices.

Classic CNN architectures

Understanding the lineage helps you pick backbones and read papers:

LeNet-5 (1998)

Yann LeCun's digit recognizer: conv → pool → conv → pool → dense. Proved convolutions work for grayscale OCR. Tiny by modern standards but the template persists.

AlexNet (2012)

Won ImageNet with ReLU, dropout, and GPU training. Showed deep CNNs crush hand-crafted features. Two-stream GPU trick because 2012 cards lacked memory for one big model.

VGG (2014)

Uniform 3×3 stacks — simple, heavy. VGG-16 has ~138M parameters. Still used as a perceptual loss backbone in style transfer because feature maps are smooth and interpretable.

ResNet (2015)

Residual connections (skip connections) add the layer input to its output: y = F(x) + x. Gradients flow directly through shortcuts, enabling 50–152+ layer networks without degradation. ResNet-50 is the default pretrained backbone for transfer learning on small image datasets. Variants like ResNeXt, EfficientNet, and ConvNeXt refine width, depth, and compound scaling.

Beyond classification backbones

Detection heads (Faster R-CNN, YOLO, RetinaNet) and segmentation decoders (U-Net, FPN) bolt onto CNN encoders. Encoder–decoder shapes with skip connections preserve spatial detail for pixel-wise masks — common in medical imaging and satellite analysis.

Training CNNs in practice

Data and augmentation

CNNs are data-hungry. With fewer than ~1,000 images per class, start from ImageNet-pretrained weights rather than random init. Pair with augmentation: random crops, horizontal flips, color jitter, and (carefully) rotation. Aggressive augmentation that never appears at inference hurts — keep train-serve parity.

Loss functions and metrics

Multi-class classification uses softmax cross-entropy. Imbalanced datasets may need class weights or focal loss. Track per-class precision and recall, not accuracy alone — 99% accuracy means nothing if one rare defect class is always missed.

Optimization

Adam or SGD with momentum are standard. Use learning-rate warmup and cosine decay for fine-tuning. Freeze early layers, train only the head for a few epochs, then unfreeze top blocks with a lower LR. Watch validation loss for overfitting — CNNs memorize small sets quickly.

Hardware and batch size

Training is GPU-bound; inference on CPU is viable for small models with quantization. Batch size affects batch-norm statistics — if you shrink batches on a memory-limited GPU, consider GroupNorm or accumulating gradients over several forward passes.

CNNs vs vision transformers (ViT)

Vision transformers slice images into patches and run self-attention — flexible global context from layer one. ViTs excel when pretrained on huge datasets (JFT, LAION) and often beat CNNs on accuracy leaderboards.

CNNs still win when:

Data is limited — inductive bias helps small medical or industrial datasets.
Latency and memory are tight — mobile-optimized CNNs (MobileNet, EfficientNet-Lite) run on NPUs with mature tooling.
Spatial priors matter — dense prediction tasks with fixed resolution grids.

Hybrid ConvNeXt and Swin Transformer designs blur the line — study your constraints (accuracy, ms latency, watt budget) before picking a religion.

Production deployment considerations

A notebook metric is not a product. Before shipping:

Export format — ONNX, TorchScript, or TensorFlow SavedModel for cross-runtime deployment; TensorRT or Core ML for hardware-specific fusion.
Input pipeline — resize, center crop, and normalization (ImageNet mean/std) must match training exactly.
Quantization — INT8 post-training quantization shrinks models 4× with small accuracy loss on many CNNs; validate on a holdout set after quantize.
Monitoring — track input distribution shift (blur, lighting, new camera sensors) and trigger retraining when drift alarms fire.
Explainability — Grad-CAM heatmaps show which regions drove a classification, useful for regulated domains and debugging false positives.

Common anti-patterns

Training from scratch on 500 images — use pretrained backbones unless you have millions of labeled examples.
Leaking test images into augmentation tuning — validation set must stay untouched until final eval.
Mismatched preprocessing at serve time — the classic "works in Jupyter, fails in production" bug.
Ignoring aspect ratio — squashing portraits into squares distorts faces; letterbox or center crop consistently.
Chasing ImageNet accuracy on your factory defect set — domain-specific fine-tune beats a bigger general model.
Stacking depth without residuals — plain deep VGG-style stacks are harder to train than ResNet blocks.
No failure analysis — inspect confusion matrices and hard negatives before adding model capacity.

Production checklist

Define task: classification, detection, or segmentation — pick head and metrics accordingly.
Audit label quality and class balance; fix annotation guidelines before training.
Start from a pretrained backbone (ResNet-50, EfficientNet-B0, or domain-specific checkpoint).
Lock augmentation policy and document train vs inference preprocessing.
Split train/val/test with group-aware splits if images come from same sessions or devices.
Track per-class metrics, confusion matrix, and calibration on validation data.
Export to target runtime; benchmark latency and memory on production hardware.
Validate quantized models if deploying INT8 on edge devices.
Log prediction distributions and input stats for drift monitoring post-launch.
Version datasets, weights, and preprocessing configs alongside MLOps artifacts.

Key takeaways

CNNs exploit spatial locality and weight sharing to learn hierarchical visual features efficiently.
Conv → activation → pool stacks shrink spatial size and grow semantic depth; receptive field grows with depth.
ResNet skip connections made very deep training stable and remain the default fine-tuning backbone.
Transfer learning and augmentation are mandatory on small real-world datasets.
CNNs still compete with ViTs on edge latency, limited data, and mature deployment tooling — choose by constraints, not hype.