Guide

CLIP explained: contrastive language-image pretraining

Harbor Commerce's catalog had 180,000 SKUs with inconsistent photo metadata. Merchants tagged “blue jacket” while buyers searched “navy windbreaker rain shell.” Keyword search missed visually similar items and returned nothing for lifestyle photos that never received alt text. The refactor: embed every product image with a frozen CLIP ViT-B/32 encoder, store 512-d vectors in the existing vector index, and route text queries through the paired text encoder. A query for “waterproof hiking layer” now surfaces Gore-Tex shells even when titles omit those words. Click-through on image search rose 22%; manual re-tagging backlog dropped by half because new uploads inherit semantic retrieval without hand labels.

CLIP (Contrastive Language-Image Pretraining), introduced by Radford et al. at OpenAI in 2021, trains separate image and text encoders on 400 million noisy web image-caption pairs so that matching pairs land near each other in a shared embedding space. The model never sees ImageNet class IDs during training, yet at inference it can zero-shot classify by comparing an image embedding against text prompts like “a photo of a {label}.” That capability powers semantic image search, content moderation filters, Stable Diffusion text conditioning, and the vision backbones inside modern vision-language models. This guide explains the dual-encoder architecture, the symmetric InfoNCE loss, prompt engineering for zero-shot use, OpenCLIP and SigLIP successors, documents the Harbor Commerce visual search refactor, compares techniques in a decision table, lists common pitfalls, and provides a production checklist alongside our contrastive learning guide and ViT guide.

Dual encoders and shared embedding space

CLIP is not a single fused transformer that ingests pixels and tokens together. It uses two independent encoders whose outputs are L2-normalized to unit vectors in a common dimension (512 for ViT-B/32, 768 for ViT-L/14):

  • Image encoder — ResNet-50/101 or a Vision Transformer (ViT) that splits a 224×224 image into 16×16 patches, runs self-attention, and projects the CLS token to the embedding dimension.
  • Text encoder — A causal Transformer (GPT-style) with 12 layers, 8 heads, 512 width, and a 49,152 BPE vocabulary. It reads up to 77 tokens and projects the end-of-text position to the same embedding dimension.

Similarity between an image I and caption T is the dot product sim(I, T) = fI(I) · fT(T) after normalization — equivalent to cosine similarity. Training pulls true pairs together and pushes the other N − 1 captions in a mini-batch away from each image (and symmetrically for each caption). No classification head is required; the geometry of the space is the task representation.

Why decoupled encoders scale

Dual encoders allow asymmetric inference: precompute image embeddings offline (Harbor Commerce indexes 180k vectors once) while encoding live text queries at search time. Cross-attention fusion models like LLaVA must run the full multimodal forward pass per query-image pair — accurate but expensive at catalog scale. CLIP trades fine-grained grounding for retrieval speed, which is why RAG pipelines often use CLIP or SigLIP for first-stage candidate retrieval before a heavier cross-encoder rerank, mirroring the text-side pattern in our text embedding guide.

Contrastive training objective

Given a batch of N image-text pairs, CLIP forms an N × N similarity matrix. The diagonal entries are positives; all off-diagonal entries are in-batch negatives. The loss is symmetric InfoNCE (softmax cross-entropy) over rows and columns:

For image i, classify which of the N captions matches by treating sim(Ii, Tj) / τ as logits, where τ is a learned temperature scalar (initialized around 0.07, then exponentiated in practice). The same operation runs with captions as queries and images as keys. This in-batch negative strategy is the same InfoNCE framework used by SimCLR, but CLIP's negatives are cross-modal rather than augmented views of the same image.

Data and compute at scale

Original CLIP trained on WIT (WebImageText): roughly 400 million pairs scraped from the public web with minimal filtering. Captions are noisy (“img_3847.jpg” sits next to useful descriptions), but volume compensates. Open-source reimplementations like OpenCLIP and LAION-trained checkpoints replicate the recipe on LAION-400M/5B. Training ViT-L/14 at 336px resolution required thousands of GPU-days; most product teams fine-tune smaller checkpoints or use frozen encoders as Harbor Commerce does.

Zero-shot classification and prompt engineering

Because CLIP never saw fixed class indices, you classify at inference by embedding candidate text prompts and picking the highest-similarity match. ImageNet evaluation uses templates such as “a photo of a {class name}” or an ensemble of 80 prompt variants averaged per class. Small wording changes shift accuracy by several points — prompt engineering is not optional.

Prompt patterns that work

  • Context prefix — “a photo of”, “a rendering of”, “a satellite photo of” disambiguate domain.
  • Ensembling — Average embeddings of multiple templates per class before softmax; reduces variance from any single phrasing.
  • Negative prompts — For moderation, compare against “safe product photo” vs “explicit content” rather than single-class scoring.
  • Hierarchy — Coarse pass (“footwear”) then fine pass (“running shoe”) when label space exceeds a few hundred classes.

Zero-shot ImageNet top-1 reached ~76% with ViT-L/14 — competitive with supervised ResNet-50 despite no labeled ImageNet training. Fine-grained distinctions (breeds, medical imaging subtypes) still benefit from labeled fine-tuning or specialist models.

Harbor Commerce visual search refactor

The legacy pipeline stored merchant-supplied tags and ran PostgreSQL full-text search. Problems: tag drift across locales, no cross-language retrieval, lifestyle photos without tags invisible to search, and synonym gaps (“sneaker” vs “trainer”). The CLIP migration proceeded in four stages:

  1. Offline embedding — Nightly job resizes product hero images to 224px, runs ViT-B/32 image encoder, upserts vectors into pgvector with HNSW index (cosine distance).
  2. Query path — User text query encoded by text encoder; top-50 ANN candidates fetched; optional cross-encoder rerank on title+description for final ordering.
  3. Hybrid fallback — BM25 on SKU and brand fields unioned with vector hits; reciprocal rank fusion merges lists when queries contain exact model numbers.
  4. Moderation hook — Upload pipeline scores images against a small set of safety prompts; flagged items queue for human review before indexing.

Latency budget: 12 ms text encode (CPU ONNX), 4 ms HNSW search at 180k vectors, 45 ms rerank on top-20. Total p95 under 80 ms — acceptable for catalog browse. Fine-tuning CLIP on in-house click pairs is queued for a later iteration; frozen weights already beat the keyword baseline.

OpenCLIP, SigLIP and successors

The original OpenAI weights are proprietary for the largest checkpoints; the community rebuilt training with open data and code:

  • OpenCLIP — Reproducible training on LAION; supports CoCa (contrastive + captioning) and larger ViT-G models. Default choice when you need transparent data provenance.
  • SigLIP — Replaces softmax InfoNCE with a sigmoid loss per pair, stabilizing training at very large batch sizes and improving retrieval metrics on DataComp benchmarks.
  • EVA-CLIP / DFN — Better ViT initialization and filtered datasets push zero-shot ImageNet past 80%.

For generative pipelines, CLIP text encoders (or OpenCLIP variants) supply conditioning vectors to UNet cross-attention in latent diffusion. Newer models like SD3 use separate text encoders (T5 + CLIP), but CLIP remains the lightweight option for retrieval and routing layers in multimodal stacks.

Technique decision table

Goal Recommended approach Why not plain CLIP
Large-scale image-text retrieval Frozen CLIP or SigLIP dual encoder + ANN index Cross-attention VLMs too slow per pair at millions of items
Zero-shot classification (< 1k classes) CLIP with prompt ensembling Supervised fine-tune wins if labeled data exists
Diffusion text conditioning CLIP or OpenCLIP text encoder (possibly + T5) Raw BPE tokens without pretrained alignment underperform
Pixel-level segmentation / VQA LLaVA, Florence, or SAM + language head CLIP embeddings are global; no spatial map without extensions (DenseCLIP, MaskCLIP)
Self-supervised vision pretrain (no text) DINOv2, SimCLR, MoCo CLIP requires paired captions; pure vision data wasted
Fine-grained medical or satellite imagery Domain fine-tuned ViT or specialist CLIP on in-house pairs Web-scale CLIP under-represents niche modalities

Common pitfalls

  • Wrong image preprocessing — CLIP expects specific resize, center crop, and normalization constants; ImageNet defaults shift embeddings.
  • Single prompt per class — Zero-shot accuracy swings wildly; ensemble templates or learnable prompt vectors (CoOp) stabilize results.
  • Ignoring text length limit — Queries over 77 tokens truncate silently; chunk long descriptions before encoding.
  • Cosine vs dot product after normalization — Always L2-normalize before ANN index build; unnormalized vectors break cosine HNSW assumptions.
  • Batch size too small for fine-tuning — Contrastive learning needs large batches for negatives; use gradient accumulation or memory bank.
  • Evaluating on training captions — Web-scraped alt text in eval leaks near-duplicates; hold out by image hash or source domain.
  • Assuming CLIP understands counting or spatial relations — “three red balls” vs “two red balls” often confuses global embeddings; use VLM reasoning for compositional queries.
  • Stale index after image CDN changes — Re-embed when hero URLs update; hash filenames alone miss crop and background edits.

Production checklist

  • Pick checkpoint size (ViT-B/32 for latency, ViT-L/14 for quality) and document OpenAI vs OpenCLIP provenance.
  • Match preprocessing script to training recipe (224 vs 336 resolution, interpolation mode).
  • L2-normalize all stored vectors; use cosine distance in ANN index (HNSW, ScaNN, or FAISS).
  • Design prompt templates for zero-shot or query expansion; A/B ensemble vs single prompt.
  • Hybridize with BM25 for exact SKU, brand, and model-number matches.
  • Monitor embedding drift when swapping checkpoint versions; plan re-index migration.
  • Cap text input at 77 tokens; summarize long catalog descriptions for encoding.
  • Add safety prompt scoring on user uploads before indexing UGC images.
  • Benchmark recall@k on held-out query-image pairs from click logs, not just offline caption retrieval.
  • Profile CPU ONNX vs GPU batch encode for nightly catalog refresh windows.

Key takeaways

  • CLIP trains separate image and text encoders with symmetric InfoNCE so matching pairs share direction in embedding space.
  • Zero-shot classification works by comparing image embeddings to text prompt embeddings — no fixed label head required.
  • Harbor Commerce replaced keyword tags with frozen CLIP vectors plus hybrid BM25, improving image search CTR 22%.
  • Dual encoders excel at retrieval scale; cross-attention VLMs excel at grounding and VQA — compose both in production.
  • OpenCLIP and SigLIP extend the recipe with open data, larger models, and improved sigmoid losses.

Related reading