Guide

Vision Transformers (ViT) explained

Your warehouse team photographs every inbound pallet for damage detection. A ResNet-50 fine-tuned on 4,000 labeled images hits 91% accuracy, but misses hairline cracks that span half the frame — the model's early conv layers never see the full context at once. Vision Transformers (ViT, Dosovitskiy et al., 2020) treat an image as a sequence of patch tokens and run a standard transformer encoder over them. Every patch attends to every other patch from layer one, giving global receptive field without deep stacking. ViT underperforms CNNs on small datasets trained from scratch, but matches or beats them when pretrained on large image corpora and fine-tuned downstream. This guide covers ViT architecture, CNN inductive-bias trade-offs, pretraining and fine-tuning recipes, Swin and hybrid variants, a Harbor Supply packaging-defect worked example, an architecture decision table, common pitfalls, and a practitioner checklist alongside our computer vision overview and object detection guide.

What a Vision Transformer is

A ViT is a transformer encoder applied to images. Instead of word tokens, the input is a grid of patch embeddings: the image is split into fixed-size squares (commonly 16×16 pixels), each flattened and linearly projected to a d-dimensional vector. A learnable class token ([CLS]) is prepended — analogous to BERT — and positional encodings are added so the model knows where each patch came from. Stacked multi-head self-attention and feed-forward blocks (identical in spirit to NLP transformers) process the sequence; the final [CLS] representation feeds a classification head.

ViT is not a drop-in replacement for every vision task. It excels at image-level classification and as a backbone for detection and segmentation when adapted (e.g. DETR, Mask2Former). It is less mature for tiny embedded devices without quantization, and naive ViT ignores local structure that convolutions exploit for free — which matters when data is scarce.

Patch size sets the token budget

A 224×224 image with 16×16 patches yields (224/16)² = 196 patch tokens plus one CLS — 197 tokens per image. Halving patch size to 8×8 quadruples the token count to 784, which squares attention cost. Production teams usually keep patches at 14 or 16 for classification and reserve finer patches for high-resolution fine-tuning or hybrid models.

CNN inductive biases vs transformer flexibility

Convolutional networks bake in two useful assumptions: locality (nearby pixels matter more) and translation equivariance (a cat in the top-left is the same feature as a cat in the bottom-right). These biases act as regularizers — a ResNet can learn useful features from thousands of labeled images.

ViT has minimal built-in spatial prior. Attention is global from the first layer; the model must learn locality from data. The original ViT paper showed that on ImageNet-1k alone (1.3M images), ViT-Large underperformed ResNet of similar compute. Pretrained on JFT-300M (hundreds of millions of images), the same ViT surpassed ResNet. The lesson: ViT is data-hungry at pretraining time but transfers powerfully once you have a strong checkpoint.

Modern practice rarely chooses pure ViT vs pure CNN in isolation. Hybrids like Swin Transformer (shifted-window attention for local + hierarchical features) and ConvNeXt (ConvNet architecture modernized with transformer training recipes) often win benchmarks on efficiency-adjusted leaderboards.

Architecture walkthrough

1. Patch embedding

Patches can be created with a strided convolution (kernel = patch size, stride = patch size) — equivalent to linear projection but efficient on GPU. Input shape [B, 3, H, W] becomes [B, N, d] where N = (H/p)(W/p).

2. Positional encoding

ViT uses learnable 1D positional embeddings (one vector per patch index), not the sinusoidal scheme from the original "Attention Is All You Need" paper. At fine-tune time, if input resolution changes, positional embeddings are interpolated — a standard trick when moving from 224px pretraining to 384px or 512px downstream.

3. Transformer encoder blocks

Each block: LayerNorm → multi-head self-attention → residual → LayerNorm → MLP (expand 4×, GELU) → residual. Pre-LayerNorm (norm before sublayers) is standard in recent checkpoints and trains more stably than post-norm. Attention complexity is O(N²) in patch count — the main scaling bottleneck for high-resolution images.

4. Classification head

The [CLS] output after the final block passes through a linear layer to class logits. For fine-tuning, practitioners often replace the head, use a small learning rate on the backbone, and a higher rate on the head — or apply linear probing first (freeze backbone, train head only) to sanity-check data quality before full fine-tune.

Pretraining, fine-tuning, and scale

Open checkpoints from Google (ViT-B/16, ViT-L/16), Meta (DeiT — data-efficient training with distillation), and Microsoft (Swin) are available on Hugging Face and torchvision. Typical fine-tuning recipe:

Resolution: start at 224px; increase to 384px for +1–2% accuracy if latency allows.
Augmentation: RandAugment, Mixup, CutMix — ViT benefits more from aggressive augmentation than CNNs on small sets.
Optimizer: AdamW with cosine decay, weight decay 0.05, warmup 5–10 epochs.
Regularization: stochastic depth (drop path), label smoothing, early stopping on validation loss.
Layer-wise LR decay: lower learning rates for early blocks, higher for later blocks and head — reduces catastrophic forgetting of pretrained features.

DeiT showed that distilling a ViT from a strong CNN teacher lets smaller models reach competitive accuracy without JFT-scale pretraining — important when you cannot train from scratch but need a compact deployable model.

Worked example: Harbor Supply packaging defect classifier

Harbor Supply receives 12,000 daily SKU photos from fulfillment cameras. Labels: ok, crushed_box, label_misaligned, seal_broken (4-class). Baseline: ResNet-50 fine-tuned 30 epochs, 224px, 88.4% top-1 val accuracy; confused crushed_box with shadow artifacts on dark conveyor belts.

ViT-B/16 (ImageNet-21k pretrained, Hugging Face google/vit-base-patch16-224-in21k):

Froze backbone 5 epochs (linear probe): 84.1% — confirms labels are learnable.
Full fine-tune 40 epochs, LR 3e-5 backbone / 1e-3 head, RandAugment + Mixup 0.2, batch 32 on one A10G: 93.6% val top-1.
Resolution bump to 384px (+12 epochs): 94.8%; inference 14ms → 38ms per image — acceptable for post-pack QA station, not inline 60fps belt.
Errors shifted from shadow confusion to rare seal_broken on reflective tape — addressed with 200 targeted hard-negative samples, not architecture change.

Deployment: ONNX export, TensorRT FP16 on edge GPU. Fallback ResNet-50 kept for sub-10ms preview tier on low-end cameras. Key takeaway: ViT's global attention helped on defects spanning large box faces; the win came from pretrained weights + resolution, not from abandoning CNNs everywhere.

Architecture decision table

Approach	Best when	Watch out for
CNN (ResNet, EfficientNet)	<5k labeled images, edge latency <10ms, mature tooling	Global context needs very deep stacks or FPN
ViT (pretrained)	5k–500k labels, image classification, have GPU fine-tune budget	Attention quadratic in patches; needs aug + LR tuning
Swin / hierarchical ViT	Detection, segmentation, high-res inputs, need CNN-like efficiency	More complex implementation than vanilla ViT
CLIP zero-shot / linear probe	Novel classes, few labels, text descriptions of categories available	Domain gap if photos differ wildly from web pretraining
Train ViT from scratch	Millions+ in-domain images, research setting	Almost always worse than fine-tuning pretrained weights below 100M images

Common pitfalls

Training ViT from scratch on 2k images. Use a pretrained checkpoint or a CNN; ViT needs scale or distillation (DeiT) to compete.
Ignoring resolution mismatch. Fine-tuning at a new resolution requires positional embedding interpolation — frameworks do this automatically only if you configure image size correctly.
Undersized batch with BatchNorm confusion. ViT uses LayerNorm, not BatchNorm — small batches are fine, but learning rate may need linear scaling rules when batch size changes.
Attention maps as explanations. ViT attention weights are not reliable saliency maps; use Grad-CAM or dedicated explainability tools from our XAI guide.
Deploying FP32 giant models. ViT-B is ~86M parameters; quantize (INT8) or distill to MobileViT for mobile. Profile latency before committing.
Skipping baseline. A tuned EfficientNet-B0 often beats an improperly fine-tuned ViT — always benchmark both on your data.

Practitioner checklist

Start with a strong pretrained checkpoint (ImageNet-21k or CLIP), not random init.
Run linear probe before full fine-tune to validate labels and class balance.
Use RandAugment + Mixup/CutMix; ViT responds well to heavy augmentation.
Apply layer-wise LR decay and AdamW with cosine schedule.
Benchmark against ResNet/EfficientNet at same resolution and compute budget.
Increase input resolution if defects are small relative to frame size.
For detection/segmentation, prefer Swin or ViT-detector heads over raw CLS-only ViT.
Export with ONNX/TensorRT; measure P50/P95 latency on target hardware.
Monitor out-of-distribution inputs (lighting, camera angle) — transformers overfit texture shortcuts too.