Guide

Optical character recognition (OCR) explained

A warehouse clerk photographs a crumpled supplier invoice. A commuter snaps a parking sign. A historian digitizes a 1920s ledger page. Each image holds text that software must read without a human retyping every character. Optical character recognition (OCR) is the branch of computer vision that converts pixels into Unicode strings — often as the first step in search, compliance archives, or retrieval-augmented generation (RAG) over scanned documents. This guide covers classical and neural OCR pipelines, detection versus recognition, layout and table parsing, evaluation with character and word error rates, a Harbor Supply invoice digitization worked example, an approach decision table, common pitfalls, and a practitioner checklist — alongside object detection and multimodal AI.

What OCR must solve

Printed text on clean PDFs is the easy case. Production OCR faces skewed phone photos, low DPI faxes, bleed-through from the reverse page, mixed fonts, handwriting, stamps overlapping lines, and multilingual invoices where English SKUs sit beside Chinese vendor names. The system must answer three questions: where is text on the page, what characters does each region contain, and how do those regions relate (columns, tables, headers, footnotes)?

Document OCR vs scene text

  • Document OCR — flat pages, high contrast, structured reading order; typical inputs are scans, PDF renders, and mobile captures of receipts.
  • Scene text — arbitrary text in the wild (signs, product labels, license plates); needs robust text detection before recognition and handles perspective distortion.
  • Handwriting OCR (HWR) — cursive and inconsistent glyphs; often needs dedicated models or human-in-the-loop verification for legal records.

Teams that conflate these domains ship pipelines tuned for A4 scans that fail on curved bottle labels — or over-engineer scene-text detectors for born-digital PDFs where text extraction without OCR is faster and lossless.

The OCR pipeline: preprocess, detect, recognize, structure

End-to-end systems exist, but most production stacks chain specialized stages. Each stage can be swapped independently when accuracy or latency requirements change.

Image preprocessing

Raw scans arrive rotated, shadowed, or compressed. Common steps include deskewing, denoising, adaptive binarization (Otsu or Sauvola thresholds), contrast normalization, and dewarping for curved pages. Over-aggressive binarization erases faint thermal-receipt ink; under-processing leaves JPEG blocking artifacts that confuse recognizers. Keep preprocessing configurable per document class rather than one global filter chain.

Text detection (localization)

Detection finds bounding boxes or polygons around text lines and words. Classical methods use connected components and projection profiles; modern systems use CNN or transformer detectors (EAST, DBNet, CRAFT) that output quadrilaterals robust to rotation. For full pages, layout models segment blocks (title, paragraph, table, figure) before line-level detection — critical when columns would otherwise merge into gibberish reading order.

Text recognition (transcription)

Given a cropped line image, recognition predicts the character sequence. Historical pipelines used Hidden Markov Models on hand-crafted features. Deep learning replaced them with CRNN (CNN encoder + BiLSTM + CTC loss) and attention-based seq2seq decoders. Transformer recognizers like TrOCR treat line images as visual tokens and decode with autoregressive or non-autoregressive heads — strong on printed English and increasingly multilingual with fine-tuning.

Post-processing and structure

Raw strings benefit from lexicons (correcting “0” vs “O” in SKU contexts), language models that rescore improbable words, and regex validators for dates and currency. Layout parsing and table extraction rebuild rows and columns — often with dedicated models (Table Transformer, layout-aware LMs) — so downstream ERP systems receive JSON, not a single flattened paragraph.

Classical vs neural vs cloud document AI

Tesseract remains useful for clean scans, dozens of languages, and offline deployments. It expects reasonable preprocessing and struggles on scene text without custom training. Open-source neural stacks — PaddleOCR, EasyOCR, docTR — bundle detection and recognition with pretrained weights and GPU acceleration.

Cloud APIs (Google Document AI, Amazon Textract, Azure Document Intelligence) add managed layout, form key-value extraction, and confidence scores per field. You trade per-page cost and data residency constraints against faster time-to-value. For regulated industries, on-prem models plus edge inference keep documents inside the network boundary.

When multimodal LLMs enter the picture

Vision-language models can read document images directly and emit structured JSON — attractive for one-off forms with varied layouts. They are slower and costlier per page than specialized OCR, and may hallucinate field values not present in the scan. A robust pattern: specialized OCR for bulk extraction, LLM verification only on low-confidence fields or exception queues.

Evaluation: CER, WER, and field-level accuracy

OCR quality is measured at character, word, and field granularity — analogous to how speech recognition uses word error rate (WER).

  • Character error rate (CER) — edit distance at the character level; sensitive to single-digit invoice errors (9 vs 4).
  • Word error rate (WER) — substitutions, insertions, deletions per word; better for prose blocks.
  • Field accuracy — exact match on extracted invoice number, total, tax ID; what finance teams actually care about.
  • Reading order — correct transcription in wrong sequence fails RAG chunking; evaluate with layout-aware metrics on ICDAR and PubLayNet-style benchmarks.

Hold out document types not seen in training: thermal receipts, stamped approvals, low-contrast watermarks. Report confidence-calibrated rejection rates — a system that flags 8% of pages for human review but hits 99.5% on auto-processed fields often beats a lower-CER model that never abstains.

Worked example: Harbor Supply invoice digitization

Harbor Supply receives 12,000 supplier PDFs and phone photos monthly. Accounts payable needs line items in their ERP within 24 hours; scanned archives must be searchable for audits. The team already stores raw files in S3; OCR is the bridge to structured data.

Pipeline design

  1. Ingest classification — born-digital PDFs route to text extraction (no OCR); images and scanned PDFs route to the OCR stack.
  2. Preprocess — auto-rotate via EXIF, deskew, mild denoise; vendor-specific presets for thermal vs laser scans.
  3. Layout + detection — layout model segments header, line-item table, and totals block; DBNet-style detector catches skewed tables on phone photos.
  4. Recognition — TrOCR fine-tuned on 4,000 Harbor-labeled invoices for SKUs and currency formats; fallback to cloud API on confidence below 0.85.
  5. Validation — regex on invoice numbers, sum(line items) vs stated total tolerance of 0.01, duplicate hash on vendor ID + date + total.
  6. Human review queue — lowest-confidence fields surfaced in a 90-second review UI; corrections feed weekly fine-tune batches.

Results

Field-level accuracy on auto-routed documents rose from 91% (Tesseract-only baseline) to 97.8% (neural stack + validation). Median processing time dropped from 45 seconds per page (manual entry) to 1.2 seconds OCR plus 0.3% human review touch rate. CER on held-out thermal receipts improved from 4.2% to 1.1% after targeted fine-tuning — the dominant failure mode had been decimal-point confusion on faded print.

Approach decision table

Scenario Recommended approach Why
Clean born-digital PDF Native text extraction (pdftotext, PyMuPDF) Lossless, instant; OCR adds error without benefit
High-volume scanned invoices Layout model + fine-tuned recognizer + field validation Field accuracy and tables matter more than raw CER
Offline or air-gapped deployment Tesseract or PaddleOCR on-prem No cloud egress; predictable per-page cost
Scene text (signs, labels) Robust polygon detector + CRNN/TrOCR Document pipelines fail on perspective and clutter
Mixed multilingual forms Multilingual detector + per-script recognizers or cloud Document AI Single-language models misread CJK and diacritics
One-off varied forms, low volume Multimodal LLM with JSON schema prompt Layout flexibility; validate outputs strictly
RAG over document archive OCR + layout-aware chunking into vector index Reading order and headings preserve retrieval quality

Common pitfalls

  • OCR on extractable PDF text — introduces errors into otherwise perfect strings; branch on PDF text layer presence first.
  • Ignoring reading order — multi-column pages become nonsense for downstream search; invest in layout analysis early.
  • Optimizing CER on clean scans only — production traffic is phone photos and faxes; eval sets must mirror reality.
  • No confidence thresholds — finance and legal need abstention and audit trails, not forced guesses on blurry digits.
  • Skipping checksum validation — invoice totals and tax IDs have structural constraints; regex beats bigger models for catching OCR slips.
  • Table extraction as plain text — CSV-shaped data pasted into paragraphs breaks ERP imports; use table-specific models or rules.
  • PII in cloud OCR without review — contracts and medical scans may violate policy; classify document sensitivity before routing.
  • No human feedback loop — correction logs are free labeled data; weekly fine-tune beats annual big-bang retraining.

Practitioner checklist

  • Classify inputs: digital text vs scan vs photo vs handwriting before choosing a stack.
  • Build preprocessing presets per document class (thermal, A4, mobile capture).
  • Separate detection, recognition, and layout parsing so each stage can be upgraded.
  • Measure CER, WER, and field-level accuracy on held-out document types.
  • Implement confidence thresholds and a human review queue for critical fields.
  • Validate structured outputs with domain rules (totals, dates, ID formats).
  • Log corrections for continuous fine-tuning and regression monitoring.
  • Document data residency and PII handling for cloud vs on-prem routes.
  • Test reading order on multi-column and table-heavy pages before RAG indexing.
  • Benchmark end-to-end latency and cost per page, not isolated recognition speed.

Key takeaways

  • OCR chains localization, transcription, and often layout parsing — not a single monolithic model.
  • Born-digital PDFs should use text extraction; reserve OCR for pixels without a text layer.
  • CER and field accuracy tell different stories; finance workflows need the latter.
  • Confidence-based human review beats chasing zero CER on impossible scans.
  • OCR is frequently the ingest layer for search and RAG — reading order determines retrieval quality.

Related reading