Guide

Object detection explained

Object detection answers two questions at once: what is in an image, and where each instance lives. Unlike image classification, which assigns a single label to the whole frame, detection outputs a set of bounding boxes — axis-aligned rectangles (or occasionally rotated polygons) each tagged with a class and confidence score. That spatial grounding powers self-driving perception stacks, warehouse robots, retail shelf analytics, security cameras, and photo apps that find faces and pets. Under the hood, detectors are built on convolutional neural networks that learn hierarchical visual features; this guide focuses on the detection-specific machinery — box regression, overlap metrics, architecture families, evaluation, and the production pitfalls that separate demo screenshots from reliable systems.

Detection vs classification vs segmentation

The vision pipeline branches early. Image classification outputs one label per image ("cat"). Object detection outputs zero or more boxes per image, each with a label ("cat at [x,y,w,h], dog at [x,y,w,h]"). Semantic segmentation assigns a class to every pixel; instance segmentation (Mask R-CNN and descendants) combines detection boxes with per-instance pixel masks.

Choose detection when you need to count instances, track objects across video frames, crop regions for downstream OCR or quality inspection, or trigger alerts when a specific object enters a zone. Segmentation costs more compute and annotation labor; detection is often the sweet spot for real-time edge deployment. For broader context on how these tasks relate, see computer vision fundamentals.

Bounding boxes, IoU, and matching predictions to truth

A standard bounding box is four numbers: top-left corner (x, y) plus width and height, or alternatively two corners (x₁, y₁, x₂, y₂). Models typically regress offsets relative to anchor boxes — predefined templates at multiple scales and aspect ratios tiled across a feature map — though modern anchor-free detectors (FCOS, CenterNet) predict box centers and distances to edges directly, simplifying hyperparameter tuning.

Intersection over Union (IoU) measures overlap between a predicted box and a ground-truth box: area of intersection divided by area of union. IoU = 1.0 means perfect overlap; IoU = 0 means no overlap. Training assigns positive labels to anchors whose IoU with a ground-truth box exceeds a threshold (commonly 0.5); negatives fall below a lower threshold (0.4). The ambiguous middle band is often ignored to reduce label noise.

At inference, a detector may emit hundreds of overlapping boxes for the same object. Non-maximum suppression (NMS) keeps the highest-confidence box and discards others whose IoU with it exceeds a threshold (typically 0.45–0.65). Soft-NMS and learned NMS variants reduce the tendency to suppress true positives in crowded scenes.

Two-stage vs one-stage architectures

Two-stage detectors (R-CNN family)

R-CNN (2014) proposed region proposals from selective search, then classified each crop — accurate but slow. Fast R-CNN shared convolution features across proposals. Faster R-CNN added a Region Proposal Network (RPN) that learns proposals end-to-end, becoming the workhorse for accuracy-critical applications. Mask R-CNN extended Faster R-CNN with a mask head for instance segmentation.

Two-stage pipelines separate "where to look" from "what is it," which tends to improve precision on small or crowded objects at the cost of latency — often 5–15 FPS on a GPU without heavy optimization.

One-stage detectors (YOLO, SSD, RetinaNet)

YOLO (You Only Look Once) treats detection as a single regression problem over a grid: each cell predicts boxes and class probabilities directly. YOLOv5 through YOLOv11 and the Ultralytics ecosystem dominate real-time deployment; YOLOv8/v9 balance speed and mAP for edge and cloud alike. SSD (Single Shot Detector) uses multi-scale feature maps. RetinaNet introduced focal loss to down-weight easy negatives, closing the accuracy gap with two-stage models while staying single-pass.

One-stage models trade some precision on tiny objects for throughput — 30–100+ FPS on modern GPUs, enabling live video analytics. For latency-sensitive robotics or mobile AR, start with a one-stage baseline and only move to two-stage if mAP on your target classes is insufficient.

Training: labels, losses, and data discipline

Detection training requires bounding-box annotations — far more expensive than classification labels. Public datasets set the benchmark: COCO (80 classes, 330K images, crowded scenes), Pascal VOC (20 classes, simpler), Open Images (millions of boxes). For domain-specific work (defect inspection, medical imaging, satellite), plan annotation budget early — even 500 well-labeled images with transfer learning from a COCO-pretrained backbone often beats 5,000 noisy labels.

Loss functions combine classification loss (cross-entropy or focal loss per box) with localization loss (Smooth L1 or IoU-based GIoU/DIoU/CIoU for box regression). Hard example mining and balanced sampling prevent the model from collapsing to "background everywhere." Augmentations — random flip, scale jitter, mosaic (YOLO-style stitching of four images), color jitter — improve robustness; see data augmentation for principles that apply across vision tasks.

Class imbalance is endemic: "person" and "car" dominate COCO; rare classes need oversampling or focal-loss tuning. Review per-class precision-recall curves, not just headline mAP.

Evaluation: mAP, precision-recall, and what the numbers mean

Mean Average Precision (mAP) is the standard detection metric. For each class, sort predictions by confidence, compute precision and recall at each threshold, integrate the precision-recall curve to get Average Precision (AP), then average across classes. mAP@0.5 uses IoU ≥ 0.5 as a match; mAP@0.5:0.95 (COCO's primary metric) averages AP across IoU thresholds from 0.5 to 0.95 in steps of 0.05 — a much stricter test of localization quality.

A model with high mAP@0.5 but low mAP@0.5:0.95 produces boxes that are "roughly right" but sloppy on edges — acceptable for coarse counting, unacceptable for robotic grasping. Report both on your validation set. Track per-class AP to catch classes the headline mAP hides. On imbalanced custom datasets, also measure recall at a fixed false-positive rate relevant to your application (e.g., "miss no more than 1% of defects").

Video, tracking, and multi-object association

Running a detector frame-by-frame gives independent boxes with no identity continuity. Multi-object tracking (MOT) layers association on top: match current detections to previous tracks via IoU, appearance embeddings (ReID models), or motion models (Kalman filters). Pipelines like ByteTrack and BoT-SORT pair a fast YOLO detector with lightweight tracking for surveillance and sports analytics.

Temporal smoothing — exponential moving average on box coordinates, or running detection every N frames with interpolation — reduces flicker without retraining. For safety-critical systems, require consecutive-frame confirmation before triggering an action.

Deployment: latency, quantization, and domain shift

Production detectors face constraints demos ignore. Latency budgets dictate model size: YOLO-nano for mobile NPUs, YOLO-medium for edge GPUs, larger backbones for server batch inference. TensorRT, ONNX Runtime, and OpenVINO fuse layers and optimize kernels; INT8 quantization often cuts inference time 2–4× with modest mAP loss if calibration data matches deployment distribution.

Domain shift is the silent killer: a model trained on daylight COCO photos degrades on night-vision warehouse cameras. Monitor live precision/recall with human spot checks or a golden test set captured from production cameras. Retrain or fine-tune when drift exceeds thresholds — the same discipline as model drift monitoring in tabular ML.

Input resolution trades accuracy for speed: 640×640 is a common YOLO default; dropping to 416×416 may halve latency but hurt small-object recall. Profile on target hardware, not your training workstation.

Decision table: which detector family fits?

Requirement	Recommended starting point	Why
Real-time video (>30 FPS) on edge GPU	YOLOv8/v9 small or medium	Single-pass, mature export tooling, strong speed/mAP tradeoff
Highest mAP on crowded small objects	Faster R-CNN or Cascade R-CNN	Two-stage refinement; better on dense scenes at cost of latency
Instance masks needed	Mask R-CNN or YOLO-seg variant	Per-pixel instance boundaries for grasping or editing
Mobile / NPU deployment	YOLO-nano, MobileNet-SSD, or INT8-quantized one-stage	Small footprint; test on-device early
<1,000 labeled images, new domain	Transfer-learn YOLO or Faster R-CNN from COCO weights	Pretrained backbone features reduce data hunger
Open-vocabulary ("find the red wrench")	Grounding DINO, OWL-ViT, or CLIP + detector hybrid	Text-conditioned detection beyond fixed class lists

Common anti-patterns

Trusting COCO mAP on custom data — pretrain metrics rarely transfer; evaluate on your own labeled validation set.
Ignoring NMS tuning — default thresholds cause duplicate boxes or missed neighbors in crowds.
Training on mismatched aspect ratios — stretching portrait warehouse feeds to square inputs distorts object shapes.
No hard-negative mining — backgrounds that look like objects (posters, shadows) cause false positives in production.
Skipping per-class metrics — 45% mAP can mean 80% on "car" and 5% on your critical rare class.
Deploying without latency profiling on target hardware — laptop FPS misleads for Jetson, mobile, or browser WebGPU paths.

Production checklist

Task defined: detection sufficient, or need segmentation/tracking?
Annotation guidelines documented (box tightness, occluded-object policy, class list).
Train/val/test split stratified by scene type and lighting conditions.
Baseline trained with COCO-pretrained backbone; mAP@0.5 and mAP@0.5:0.95 reported per class.
NMS threshold tuned on validation crowded scenes.
Latency profiled on deployment hardware at chosen input resolution.
Quantization evaluated if edge deployment; calibration set matches live distribution.
Video pipeline includes tracking or temporal confirmation if actions depend on identity.
Drift monitoring plan: golden set, periodic human audit, retrain trigger defined.
Failure modes documented: low light, motion blur, occlusion, adversarial inputs.

Key takeaways

Object detection localizes and classifies every instance — boxes plus labels, not a single image-level tag.
IoU and NMS are the glue between raw network outputs and clean predictions.
Two-stage models favor accuracy on hard scenes; one-stage (YOLO family) favors real-time throughput.
mAP@0.5:0.95 punishes sloppy boxes; always evaluate on your domain, not just public benchmarks.
Production success depends on annotation quality, domain-matched augmentation, and deployment profiling as much as architecture choice.