Guide

Computer vision fundamentals explained

Your phone unlocks on your face. A warehouse robot picks the right box. A radiologist gets a second opinion on a scan. All of it runs on computer vision (CV) — algorithms that turn raw pixels into labels, boxes, masks, or 3D structure. CV predates modern AI (edge detectors and optical flow date to the 1960s), but the 2012 ImageNet breakthrough with convolutional neural networks (CNNs) shifted the field from hand-crafted features to learned representations. Today, vision powers diffusion image generators, autonomous driving stacks, retail checkout cameras, and multimodal LLMs that “see” screenshots. This guide walks through how images become tensors, the core task types, landmark architectures from ResNet to YOLO to vision transformers, data and evaluation discipline, and what breaks when you ship models to production.

What computer vision actually does

At the lowest level, a digital image is a grid of numbers. An 8-bit RGB photo stores three channels (red, green, blue) per pixel, each 0–255. A 1920×1080 frame is roughly six million integers — far too high-dimensional for naive tabular machine learning. Computer vision builds pipelines that compress spatial structure into useful features: edges, textures, parts, whole objects, and scene context.

Deep-learning CV models consume tensors — multidimensional arrays — usually shaped [batch, channels, height, width]. Preprocessing normalizes pixel values (often to 0–1 or z-scored), resizes to a fixed input (224×224 for many classifiers), and may apply color jitter or random crops during training. The model outputs predictions; post-processing maps logits to class names, bounding boxes, or pixel masks.

Core task families

Image classification — One label per image: “cat,” “dog,” “defect.” Simplest task; backbone for many apps.
Object detection — Multiple objects with axis-aligned boxes and class scores. Used in surveillance, robotics, document parsing.
Semantic segmentation — A class label per pixel (road vs sidewalk vs pedestrian). Common in medical imaging and autonomous driving.
Instance segmentation — Like semantic segmentation but separates individual object instances (Mask R-CNN).
Keypoint / pose estimation — Joint locations for humans or articulated objects.
Image-to-image — Super-resolution, denoising, style transfer, and generative models that output full images.

Pick the task that matches your product question. Retail shrink detection needs detection, not classification. A quality-control line may only need binary classification with a heatmap explainability overlay.

Convolutional neural networks: the workhorse

CNNs exploit translation equivariance: a cat in the top-left corner activates similar filters as a cat in the bottom-right. Convolution layers slide small learned kernels across the image, building hierarchical features — early layers detect edges and color blobs; deeper layers assemble eyes, wheels, and faces. Pooling (max or average) downsamples spatial resolution, widening receptive fields while reducing compute.

A typical classifier stacks conv blocks, global average pooling, and a fully connected softmax head. Training uses labeled images and backpropagation — the same gradient machinery described in our deep learning guide. Loss functions vary: cross-entropy for classification, combination of classification + box regression losses for detection (e.g. smooth L1, GIoU).

Landmark CNN architectures

LeNet / AlexNet — Historical foundations; AlexNet proved deep CNNs on GPU at ImageNet scale.
VGG — Uniform 3×3 stacks; simple but heavy.
ResNet — Skip connections solve vanishing gradients; ResNet-50 remains a fine default backbone.
EfficientNet / MobileNet — Compound scaling or depthwise separable convs for mobile and edge latency budgets.

In practice you rarely train ResNet from scratch on ImageNet. Transfer learning downloads pretrained weights, replaces the final layer for your class count, and fine-tunes on hundreds to thousands of domain images. That works because low-level filters generalize across domains; only high-level semantics need relearning.

Object detection: from two-stage to one-shot

Detection must answer what and where. Early two-stage detectors (R-CNN family, Faster R-CNN) propose regions then classify each — accurate but slower. One-stage detectors (YOLO, SSD, RetinaNet) predict boxes and classes in a single forward pass, trading some accuracy for real-time throughput on GPU or NPU.

YOLO in plain terms

You Only Look Once divides the image into a grid. Each cell predicts bounding boxes, objectness scores, and class probabilities. Non-maximum suppression (NMS) removes duplicate boxes overlapping the same object. Modern YOLO versions (v8, v9, v10) iterate on anchor-free designs, improved loss functions, and training tricks. For robotics or live video, YOLO-class models often beat two-stage pipelines on frames-per-second.

Evaluation metrics

IoU (Intersection over Union) — Overlap between predicted and ground-truth box; threshold (0.5 common) defines a “hit.”
mAP (mean Average Precision) — Aggregates precision-recall across classes and IoU thresholds; standard leaderboard metric on COCO.
Latency and FPS — Product metrics: a 95% mAP model useless at 2 FPS on a drone.

Always report metrics on a held-out test set captured in conditions matching deployment — different lighting, camera angles, and lens dirt destroy lab accuracy.

Segmentation and dense prediction

Segmentation assigns a label to every pixel. Encoder-decoder architectures (U-Net, DeepLab) downsample for context then upsample with skip connections to preserve sharp boundaries — critical in medical scans where a few-pixel error changes a diagnosis workflow. Instance segmentation adds a detection step: Mask R-CNN predicts boxes, classes, and a mask per instance.

Segmentation metrics include pixel accuracy (misleading on imbalanced classes) and mean IoU (mIoU) averaged across classes. For rare defects in manufacturing, track per-class recall — missing 2% of cracks may be unacceptable even if overall pixel accuracy is 99%.

Vision transformers and multimodal models

Convolution is not the only path. Vision Transformers (ViT) split an image into fixed-size patches, linearly embed each patch, add positional encoding, and run standard transformer self-attention layers. ViTs need large datasets (or strong pretraining on JFT or LAION) but scale well and unify vision with language in multimodal models.

CLIP (Contrastive Language-Image Pretraining) trains image and text encoders on paired captions so “photo of a dog” embeddings align with dog images. That enables zero-shot classification and powers generative systems: Stable Diffusion uses a CLIP text encoder for conditioning. Multimodal LLMs (GPT-4V-class, open models with vision adapters) feed patch embeddings into the language model — the same token budget concerns from LLM context guides apply to long screenshots and video frames.

CNN vs ViT: practical choice

Small datasets, edge deployment — EfficientNet, MobileNet, or distilled CNNs often win.
Large pretraining budget, multimodal fusion — ViT backbones or CLIP features.
Real-time video detection — Specialized CNN/YOLO stacks still dominate latency-sensitive paths.

Data: the hidden 80% of CV projects

Public datasets (ImageNet, COCO, Open Images) bootstrap research, but production models live or die on domain-specific labels. Annotation is expensive: bounding boxes take seconds per object; pixel masks take minutes. Strategies that work:

Active learning — Model flags uncertain images for human labelers first.
Weak supervision — Image-level labels for detection via class activation maps; noisy but cheap.
Synthetic data — Rendered scenes in Unity/Unreal with perfect masks; domain gap remains when transferring to real cameras.
Augmentation — Random flips, crops, color jitter, mosaic (YOLO), cutout — improves generalization if physically plausible.

Document label schema versioning. Renaming a class mid-project without retraining corrupts metrics. Store raw images immutably; keep annotations in JSON or COCO format with provenance (who labeled, when, which model version trained on them).

Training loop and hardware

A minimal training pipeline: dataloader with augmentation → forward pass → loss → backward pass → optimizer step (AdamW common) → validation every N epochs → early stopping on val mAP or loss. Use mixed precision (FP16/BF16) on NVIDIA GPUs to double effective batch size. Learning rate warmup plus cosine decay is a safe default.

CV is compute-hungry. ImageNet-scale pretraining needs multi-GPU nodes; fine-tuning a detector on 10k images may take hours on a single RTX-class card. Cloud spot instances cut cost; checkpoint frequently. For inference cost trade-offs (quantization, batching), see our model quantization guide — INT8 TensorRT engines apply to CNNs as well as LLMs.

Production deployment and failure modes

Research notebooks rarely ship. Production adds:

Preprocessing parity — Training resize/normalize must match serving code byte-for-byte.
Model formats — ONNX, TensorRT, Core ML, TFLite for edge; TorchScript for PyTorch services.
Monitoring — Track input drift (brightness histogram shift), latency p99, and human override rate.
Fallbacks — When confidence below threshold, route to human review instead of auto-acting.

Common failures

Spurious correlations — Model learns “watermark = class A” because of dataset bias.
Adversarial patches — Stickers that fool stop-sign detectors; defenses include adversarial training and sensor fusion.
Demographic bias — Face recognition accuracy gaps across skin tones; mitigate with balanced training data and fairness audits.
Overconfidence — Softmax scores are not calibrated probabilities; use temperature scaling or separate calibration sets.

Regulatory contexts (medical devices, automotive) require traceability from training data to deployed weights — treat CV models like audited software, not magic black boxes.

How CV connects to generative AI

Generative vision reversed the pipeline: instead of image → label, models learn label (or noise) → image. Diffusion models denoise latents into photorealistic output; GANs (still used in some real-time face filters) pit a generator against a discriminator. CV fundamentals still matter — understanding VAE latents, perceptual loss, and CLIP guidance explains why prompts fail on hands and text in images.

Video generation stacks add temporal consistency — optical flow, attention across frames, and enormous compute. For most product teams, calling a hosted API beats training video models in-house; the differentiator is integration, guardrails, and domain-specific fine-tuning on your asset library.

Decision checklist

Define the task: classification, detection, or segmentation — do not jump to detection if one label per image suffices.
Estimate label budget; start with transfer learning on a pretrained backbone.
Split train/val/test by capture session, not random frames, to detect overfitting to background.
Pick metrics aligned with business risk (recall on defects vs overall accuracy).
Benchmark latency on target hardware, not just cloud GPU mAP.
Plan monitoring for drift and a human-in-the-loop path for low-confidence predictions.