Guide

Image segmentation explained

Image segmentation assigns meaning at pixel resolution. Where object detection draws rectangles around instances, segmentation paints the scene — every pixel belongs to a class, an object instance, or background. That fine-grained map powers autonomous driving lane parsing, medical tumor outlining, satellite land-use mapping, photo background removal, and industrial defect inspection. Models build on convolutional neural networks and increasingly vision transformers; this guide covers the three segmentation families, encoder-decoder architectures, loss functions and metrics, annotation economics, a Harbor manufacturing defect pipeline worked example, method decision tables, common pitfalls, and a practitioner checklist.

Three segmentation families

Segmentation tasks differ in what they ask the model to output. Picking the wrong family wastes annotation budget and produces metrics that look good in notebooks but fail downstream.

Semantic segmentation

Every pixel gets a class label — road, sidewalk, car, sky — but two overlapping cars of the same class share one blob. The output is an H × W label map (or one-hot channels per class). Use semantic segmentation when you care about regions, not individual object counts: drivable surface, vegetation cover, organ tissue types.

Instance segmentation

Each object instance gets its own binary mask, even when classes repeat. Two people produce two separate masks. Architectures like Mask R-CNN first propose boxes, then predict a mask per proposal. Use instance segmentation when you must count, measure area per object, or track individuals across video frames.

Panoptic segmentation

Panoptic unifies semantic "stuff" (sky, grass — amorphous regions) with instance "things" (people, cars — countable objects). Every pixel has exactly one ID: either a stuff class or a thing instance ID. Modern benchmarks (COCO panoptic, Cityscapes) evaluate both branches with PQ (panoptic quality). Use panoptic when a scene mixes countable objects and continuous regions — street scenes, retail aisles, warehouse floors.

Encoder-decoder architectures

Segmentation networks typically follow an encoder-decoder pattern: a backbone (ResNet, EfficientNet, Swin Transformer) downsamples the image to capture context, then a decoder upsamples back to full resolution.

U-Net and skip connections

U-Net concatenates high-resolution encoder features into the decoder via skip connections, preserving fine edges lost during pooling. It remains the workhorse for medical imaging and any task where boundary precision matters more than real-time speed. Variants (U-Net++, Attention U-Net) add nested skip paths or attention gates to sharpen boundaries.

FCN, DeepLab, and atrous convolution

Fully Convolutional Networks (FCN) replaced the classification head with upsampling layers, proving dense prediction scales to arbitrary input sizes. DeepLab introduced atrous (dilated) convolutions to enlarge receptive fields without shrinking spatial resolution, plus ASPP (Atrous Spatial Pyramid Pooling) to capture multi-scale context. DeepLabv3+ with a lightweight decoder still competes on Cityscapes and ADE20K leaderboards.

Instance and panoptic heads

Mask R-CNN extends Faster R-CNN: a box head, class head, and parallel mask head (small FCN per RoI) predict instance masks. YOLACT and SOLO pursue real-time instance masks by decoupling prototype masks from per-instance coefficients. Panoptic FPN and Mask2Former unify stuff and thing prediction in a transformer decoder that queries learned mask embeddings — simplifying the two-branch pipelines of earlier panoptic systems.

Loss functions and class imbalance

Segmentation datasets are notoriously imbalanced: most pixels are background or dominant "stuff" classes. A naive cross-entropy loss lets the model cheat by predicting background everywhere.

  • Weighted cross-entropy — upweight rare classes by inverse frequency; simple but sensitive to weight tuning.
  • Dice loss / F1 loss — optimizes overlap directly; popular in medical imaging with tiny foreground regions.
  • Focal loss — down-weights easy pixels; borrowed from detection into dense prediction for hard-example mining.
  • Lovász-Softmax — differentiable surrogate for IoU; aligns training objective with evaluation metric.

Combine losses when boundaries and class balance both matter: e.g. cross-entropy plus Dice on the foreground channel. Monitor per-class IoU, not just mean IoU — a high mIoU can hide a collapsed rare class.

Evaluation metrics

Segmentation quality is measured by overlap between predicted and ground-truth masks.

IoU and mIoU

Intersection over Union (IoU) = area of overlap divided by area of union. For class c, IoUc compares predicted and true pixel sets. Mean IoU (mIoU) averages IoU across classes (often excluding void/ignore labels). COCO instance segmentation reports mask mAP — AP averaged over IoU thresholds from 0.50 to 0.95 — analogous to detection mAP.

Boundary and human judgment

High mIoU can still look wrong at object edges. Boundary F-score measures precision/recall within a pixel band around true contours — useful for surgical or manufacturing applications where a 2-pixel shift changes a pass/fail gate. For subjective tasks (portrait matting), A/B human preference often beats any scalar metric.

Annotation workflows and data discipline

Segmentation is annotation-expensive. A single Cityscapes frame can take 90 minutes to label polygonally. Budget accordingly.

  • Polygon vs brush vs superpixels — polygons are precise for straight edges; brush tools speed organic shapes; superpixel pre-segmentation reduces click count.
  • Weak supervision — bounding boxes, scribbles, or point clicks can train models via GrabCut-style refinement or pseudo-label propagation; expect 5–15 point mIoU gap vs full masks unless iterated.
  • Ignore regions — mark ambiguous pixels (reflections, occlusions) with a void class so they do not poison loss.
  • Domain coverage — lighting, camera angle, season, and sensor noise shift pixel statistics; data augmentation (color jitter, random crop, horizontal flip where valid) and deliberate domain sampling beat hoping the model generalizes.

For small datasets, start from a pretrained backbone via transfer learning on ImageNet or COCO weights; fine-tune the decoder and last encoder blocks first, then unfreeze deeper layers if validation mIoU plateaus.

Worked example: Harbor defect inspection

Harbor Manufacturing runs a conveyor camera that photographs metal brackets after stamping. The quality team needs to flag scratch, dent, and discoloration regions — not just "defective yes/no" — so root-cause engineers can correlate defect shape with press tooling wear.

Task definition

Semantic segmentation with four classes: background (bracket surface), scratch, dent, discoloration. Instance boundaries between two scratches on the same bracket are merged — the team cares about defect type area, not counting separate scratch events. If they later need per-scratch counts, upgrade to instance segmentation without re-annotating from scratch (masks become instance seeds).

Pipeline

  1. Collect 4,000 images across three press lines and two lighting setups; hold out 20% by production line, not random frames, to test domain shift.
  2. Annotators paint defects in CVAT with a 3-pixel brush; void-label specular glare hotspots.
  3. Train U-Net with ResNet-34 encoder; loss = 0.5 cross-entropy + 0.5 Dice on defect classes; input 512×512 random crops.
  4. Validation mIoU = 0.81 overall; scratch IoU = 0.74 (thinnest class). Boundary F-score at 3 px = 0.79.
  5. Deploy ONNX model on edge GPU; post-process with morphological opening to remove sub-10 px noise blobs; alert if total defect area exceeds 2% of bracket mask.

After six weeks, false reject rate dropped 34% versus the previous blob-detection heuristic — because pixel masks distinguished harmless oil smudges (discoloration class, low area) from structural dents.

Method decision table

Your need Recommended approach Typical architecture
Region types, no instance count Semantic segmentation U-Net, DeepLabv3+, SegFormer
Count and mask each object Instance segmentation Mask R-CNN, YOLACT, SOLOv2
Street scenes, mixed stuff + things Panoptic segmentation Mask2Former, Panoptic FPN
Tiny foreground, medical volumes Semantic + Dice/Lovász loss Attention U-Net, nnU-Net
Real-time video on edge device Lightweight semantic or YOLACT MobileNet encoder, TensorRT INT8
Interactive click-to-segment Promptable foundation model SAM (Segment Anything), fine-tuned decoder
Only have bounding box labels Weakly supervised / BoxInst Box-to-mask propagation, iterative pseudo-labels

Production deployment considerations

Training mIoU on a curated validation set is the beginning, not the end.

  • Resolution mismatch — training on 512 px crops but inferring on 4K frames loses thin structures; use sliding-window inference with overlap blending or train at production resolution.
  • Calibration — softmax thresholds per class; a 0.5 default rarely optimizes F1 on imbalanced defects.
  • Latency vs quality — full-resolution U-Net may miss a 30 fps line-speed budget; distill to a smaller encoder or run detection first to crop ROIs.
  • Monitoring — track per-class pixel prevalence over time; sudden background inflation signals camera drift or lighting failure.
  • Human-in-the-loop — route low-confidence frames to review; feed corrections back as hard-example retraining (see active learning).

Common mistakes

  • Optimizing accuracy instead of IoU — 97% pixel accuracy is meaningless when 95% of pixels are background.
  • Random train/val splits on video frames — adjacent frames leak; split by sequence or camera session.
  • Ignoring void/ignore labels — ambiguous border pixels teach the model the wrong boundary.
  • Evaluating on resized labels without nearest-neighbor — bilinear resize blurs one-hot masks and inflates IoU.
  • Chasing leaderboard architectures on 200 images — Mask2Former needs data; U-Net with a pretrained encoder wins on small sets.
  • Skipping morphological or CRF post-processing — when the downstream system needs clean regions, a 5-line opening/closing pass beats a bigger model.
  • Confusing detection boxes with segmentation — if pixel boundaries matter, boxes plus GrabCut is a stopgap, not the product.

Production checklist

  • Define semantic vs instance vs panoptic up front with downstream consumer requirements.
  • Establish annotation guide with edge-case examples (occlusion, glare, class ambiguity).
  • Split validation by domain (camera, site, time) — not i.i.d. random frames.
  • Report per-class IoU and boundary metrics, not only mIoU.
  • Use pretrained encoders and augment aggressively before collecting more labels.
  • Match training and inference resolution strategy (crop, slide, multi-scale).
  • Calibrate per-class thresholds on a hold-out operational set.
  • Monitor prediction statistics in production; alert on distribution shift.
  • Version datasets and masks; segmentation regressions are hard to eyeball.

Key takeaways

  • Segmentation is pixel-level classification — three families (semantic, instance, panoptic) answer different downstream questions.
  • Encoder-decoder CNNs and transformers recover spatial detail via skip connections, atrous convolutions, or mask-query decoders.
  • Class imbalance dominates training — weighted CE, Dice, focal, and Lovász losses align optimization with rare foreground pixels.
  • IoU and mIoU are the core metrics — supplement with boundary scores when edge precision gates quality.
  • Annotation cost drives feasibility — weak labels, transfer learning, and foundation-model prompting reduce paint time.

Related reading