Guide

Vision-language models explained

A customer attaches a screenshot of a broken checkout button to a support ticket. Your text-only LLM reads the subject line — “payment failed” — and suggests clearing cookies. The image shows a 402 error banner and a disabled “Pay with card” control that the subject never mentioned. Vision-language models (VLMs) close that gap: they ingest pixels and text in one forward pass, answering questions about charts, UI states, product photos, and scanned documents. Architectures range from dual-encoder systems like CLIP (separate image and text towers aligned in embedding space) to unified transformers where image patches become tokens beside words (GPT-4V, Gemini, LLaVA). This guide explains how VLMs differ from pure vision transformers and text-only LLMs, the main architecture families, training stages, production costs for image tokens, multimodal RAG patterns, a Harbor Support screenshot-triage worked example, an architecture decision table, common pitfalls, and a deployment checklist alongside our OCR and LLM fine-tuning guides.

What vision-language models do

A VLM maps image + text inputs to text outputs (or embeddings). Core capabilities in production today:

  • Image captioning — generate a natural-language description of scene content.
  • Visual question answering (VQA) — answer free-form questions about an image (“What color is the error banner?”).
  • Document and chart understanding — read tables, plots, and forms without a separate OCR pipeline (though OCR specialists still win on degraded scans).
  • Visual grounding — point to regions that support an answer (bounding boxes or segmentation masks).
  • Multimodal agents — browse UI screenshots, interpret diagrams, or verify physical-world states in robotics workflows.

VLMs are not a replacement for every vision task. Object detection at 60 FPS on edge hardware, medical imaging with regulatory traceability, and sub-millimeter metrology still favor specialized CNN/YOLO stacks. VLMs win when the task is semantic — understanding what an image means in language.

Architecture families

Dual-encoder (CLIP-style)

CLIP trains separate image and text encoders on 400M image–text pairs with a contrastive loss: matching pairs are pulled together in embedding space, non-matches pushed apart. At inference you embed an image once and compare cosine similarity to text candidates — fast retrieval, zero-shot classification, and image search. CLIP does not generate long answers; it scores alignment. Production pattern: CLIP retrieves candidate images or labels, then a text LLM reasons over metadata.

Cross-attention fusion (LLaVA, Flamingo, BLIP-2)

A frozen or lightly tuned vision encoder (often a ViT) produces patch features. A projection layer maps them into the LLM’s hidden dimension. Image tokens are prepended to the text prompt; the language model attends to visual tokens via cross-attention or early fusion in the transformer stack. LLaVA-style models are cheap to adapt: freeze the vision tower, train only the connector and instruction-tune the LLM on multimodal chat data.

Native multimodal transformers

Frontier models (GPT-4o, Gemini 1.5, Claude 3) train image and text in a unified token stream from the start. Images are patchified like ViT; high-resolution inputs may use tiling or dynamic resolution to control token count. These models generalize best but are available only via API or very large open weights. Token economics dominate: a single 1024×1024 image can cost 500–2000+ equivalent text tokens depending on the provider’s patch schedule.

Training pipeline

Stage 1: Alignment pretraining

Learn a shared representation from image–text pairs (web alt text, captions, synthetic data). Contrastive objectives (CLIP), captioning losses (BLIP), or prefix-language-modeling teach the vision side to speak the LLM’s language.

Stage 2: Instruction tuning

Fine-tune on curated multimodal conversations: VQA, follow-up questions, refusal when the image is ambiguous. Quality beats quantity — 100k well-filtered examples often outperform millions of noisy web pairs. Pair with synthetic data for domain-specific UI or product catalogs, but audit for hallucinated visual details.

Stage 3: RLHF / preference optimization (optional)

Human raters prefer answers that cite visible evidence, refuse when the image is unreadable, and avoid inventing text not present in the frame. Multimodal RLHF is expensive; many teams stop at instruction tuning plus automated rubrics.

Image tokens and cost

Every VLM bill is really an image token budget problem:

  • Fixed resolution — resize to 224×224 or 336×336; predictable cost, may lose fine print.
  • Dynamic tiling — split high-res images into overlapping crops; better for documents, multiplies tokens.
  • Any-resolution (AnyRes) — choose tile count based on aspect ratio; LLaVA-NeXT pattern.

Rule of thumb: always downsample user uploads before the model sees them unless the task requires reading 8pt legal text. Cache vision embeddings when the same product image is queried thousands of times (e-commerce Q&A). Route simple icon classification to small models or CLIP; reserve frontier VLMs for ambiguous screenshots.

Multimodal RAG

Text RAG retrieves chunks; visual RAG retrieves images, pages, or video frames, then conditions a VLM on the top-k results.

  1. Index — embed images with CLIP or a VLM encoder; store captions and OCR text as metadata for hybrid search.
  2. Query — user question may be text-only (“show me last quarter’s revenue chart”) or include a new photo (“is this the same defect as ticket #4412?”).
  3. Retrieve — vector search on image embeddings + BM25 on extracted text.
  4. Generate — pass retrieved images inline in the VLM context with citation instructions; grade faithfulness like text RAG.

Failure mode: retrieving visually similar but semantically wrong images (all red error dialogs look alike). Add structured metadata (app version, SKU, timestamp) and re-rank with a cross-encoder.

Worked example: Harbor Support screenshot triage

Harbor Support receives 2,400 tickets/week; 38% include screenshots. The team deploys a VLM-assisted triage lane:

  1. Baseline — text-only classifier on subject + body; 62% correct priority (P1/P2/P3); misses visual-only signals (stack traces in images, wrong-language UI).
  2. Pipeline — resize uploads to max 1280px; run open LLaVA-1.6-7B on-prem for privacy; prompt: “List visible error messages, UI state, and product area. JSON only.”
  3. Merge — concatenate VLM JSON summary with ticket text; feed to existing text classifier.
  4. Result — 74% priority accuracy (+12 pp); P1 recall on payment failures up 19 pp because 402/403 banners are read from pixels.
  5. Cost — ~1.8s GPU latency per image; batch off-peak for non-urgent queue; urgent P1 keywords skip VLM and page on-call immediately.
  6. Guardrails — VLM output is features, not customer-facing text; human agents see the summary plus thumbnail; model refuses when image is blank or NSFW (separate filter).
  7. Iteration — fine-tune connector on 3k labeled Harbor screenshots; swap 7B for API GPT-4o-mini on complex multi-panel dashboards only (15% of volume).

The pattern — cheap open VLM for bulk triage, frontier API for edge cases — is how most support teams land multimodal without 10× inference bills.

Architecture decision table

NeedBest approachWhy
Image search / zero-shot labelsCLIP dual encoderFast embedding similarity; no generation needed.
Chat over user photos (moderate quality)LLaVA-class cross-attention + 7B–13B LLMSelf-hostable; instruction-tuned for VQA.
Complex documents, charts, multi-image reasoningFrontier unified VLM APIHigher capability; pay per image token.
High-volume OCR on clean scansDedicated OCR + text LLMCheaper and more accurate than VLM on degraded text.
Real-time video (30 fps)Specialized detection + track, VLM on keyframesFull VLM per frame is cost-prohibitive.
Product catalog Q&ACLIP retrieval + text RAG + VLM verifyCache embeddings; VLM only on top-3 images.
On-device mobileQuantized small VLM (MobileVLM, Phi-3-Vision)Latency and privacy; accept quality tradeoff.
Regulated medical imagingFrozen specialist CNN + human reviewVLM hallucination risk is unacceptable.

Common pitfalls

  • Hallucinating text in images — VLMs invent error codes not visible; require JSON fields with confidence or “not visible” option.
  • Ignoring EXIF orientation — sideways photos break reading order; normalize rotation before inference.
  • Full-resolution uploads — 4K screenshots explode token bills; cap longest edge (1280–2048px).
  • Using VLMs for pure OCR — on noisy fax scans, Tesseract or PaddleOCR + LLM beats end-to-end VLM on cost and accuracy.
  • No PII redaction — screenshots contain emails and card last-four; blur regions before logging or third-party APIs.
  • Prompt injection via images — text in images can say “ignore prior instructions”; treat image text as untrusted input.
  • Evaluating on clean stock photos — production images are blurry, cropped, and dark; build an internal “bad screenshot” set.
  • Single-model for all locales — UI language in image may differ from ticket body; pass locale hint or use multilingual VLM.

Production checklist

  • Define max image dimensions and accepted formats (JPEG, PNG, WebP; reject SVG exploits).
  • Strip EXIF GPS and serial metadata before storage.
  • Benchmark image-token cost per task at p50 and p95 resolution.
  • Build a 200+ image eval set with human labels (VQA, priority, OCR fields).
  • Compare VLM-only vs OCR+LLM vs CLIP-retrieval baselines on that set.
  • Log VLM summaries separately from user-visible replies for audit.
  • Add NSFW and blank-image pre-filters before GPU inference.
  • Cache vision embeddings for repeated catalog images.
  • Document refusal behavior when image is unreadable.
  • Monitor hallucination rate with periodic human review of sampled outputs.

Key takeaways

  • Vision-language models fuse image understanding with language generation — essential for screenshots, documents, and visual search.
  • Architectures span dual-encoder retrieval (CLIP), cross-attention adapters (LLaVA), and native multimodal transformers (frontier APIs).
  • Training is alignment pretrain plus instruction tuning; RLHF optional for citation quality.
  • Cost is driven by image tokens — resize, tile wisely, and cascade to smaller models.
  • Multimodal RAG combines image retrieval with VLM generation; metadata beats pure visual similarity.

Related reading