Guide

Named entity recognition explained

Named entity recognition (NER) is the task of locating and classifying real-world references in text — people, organizations, locations, dates, product names, medical codes, cryptocurrency tickers, and domain-specific labels you define yourself. A support ticket that says "Jane Doe at Acme Corp in Berlin reported CVE-2024-1234 on March 3" becomes structured fields your CRM, search index, or compliance pipeline can act on. NER sits at the foundation of information extraction: it turns unstructured prose into typed spans that power RAG metadata filters, knowledge graphs, alert routing, and PII redaction. This guide covers tagging schemes like BIO, classical vs transformer models, when LLM structured extraction beats fine-tuned taggers, entity linking to canonical IDs, evaluation metrics, and a production checklist.

What NER produces

The output of a NER system is a list of spans — start offset, end offset, surface text, and entity type. For the sentence "Apple announced iOS 19 in Cupertino on Tuesday," a general-domain model might return:

  • ORG: "Apple" (0–5)
  • PRODUCT: "iOS 19" (16–22)
  • GPE (geo-political entity): "Cupertino" (26–35)
  • DATE: "Tuesday" (39–46)

Offsets matter. Downstream systems use them to highlight text in a UI, replace spans with placeholders for privacy, or attach structured fields to the exact tokens a user wrote. A model that finds the right string at the wrong position breaks highlight overlays and diff-based redaction.

NER is not the same as topic classification (one label per document) or sentiment analysis (one polarity score). It is sequence labeling: every token (or subword) receives a tag, and contiguous runs of entity tags form spans. That framing connects NER to part-of-speech tagging and slot filling in dialogue systems.

Tagging schemes: IO, BIO, and BIOES

Because entities can be multi-token ("New York City", "European Central Bank"), taggers need a scheme that marks boundaries. The most common is BIO (Beginning, Inside, Outside):

  • O — token is outside any entity.
  • B-PER — beginning of a person entity.
  • I-PER — inside continuation of the same person entity.

BIO prevents two adjacent entities of the same type from merging: "John Smith met Mary Jones" tags B-PER John, I-PER Smith, O met, B-PER Mary, I-PER Jones. Without the B- prefix, "Smith met Mary" could be misread as one long person span.

BIOES variants

BIOES adds E- (end) and S- (single-token entity) tags. Some transformer models train slightly faster with BIOES because single-token entities get an explicit S- label instead of a lone B- with no following I-. At inference time, BIO and BIOES decode to the same span list — pick whichever your training framework defaults to and stay consistent.

Schema design for your domain

Off-the-shelf models ship with coarse types (PER, ORG, LOC, MISC in CoNLL-2003; 18 types in OntoNotes). Production systems almost always extend the schema: ACCOUNT_ID, TICKER, CONTRACT_ADDRESS, DIAGNOSIS_CODE. Keep types mutually exclusive per span, document boundary cases ("Amazon" the company vs the river), and write annotation guidelines before labeling — ambiguous schemas destroy inter-annotator agreement.

Model families: from CRFs to transformers

Classical pipeline: spaCy and CRF taggers

Traditional NER stacks run tokenization, part-of-speech tagging, and a linear-chain conditional random field (CRF) over hand-crafted features (word shape, prefix/suffix, gazetteer lookups). spaCy ships fast statistical and CNN/transition-based models per language; en_core_web_trf swaps in a transformer backbone for higher accuracy at higher latency.

Classical models excel when you need sub-10 ms inference on CPU, predictable memory, and no GPU fleet. They struggle on social text, typos, and emerging entities (new token listings, product codenames) unless you retrain or maintain gazetteers.

Transformer encoders: BERT, RoBERTa, DeBERTa

Fine-tuning a pretrained transformer encoder on labeled NER data is the modern default. Each subword token gets a contextual embedding; a linear classification head predicts BIO tags. Models like dslim/bert-base-NER and domain-specific clinical/legal checkpoints reach strong F1 on standard benchmarks.

Subword tokenization introduces an alignment step: WordPiece tokens may split "Playing" into Play + ##ing. Inference code maps subword predictions back to word boundaries — usually first-subword-wins or voting heuristics. Getting alignment wrong is a common source of off-by-one span errors.

LLM zero-shot and few-shot extraction

Large language models can extract entities via prompting: "Return JSON with keys people, organizations, dates found in this text." With JSON schema constraints, you get flexible schemas without training data — useful for prototypes and rare entity types.

Trade-offs are real. LLM extraction costs more per document, latency is higher, and models may hallucinate entities not present in the source text. For high-volume, well-defined schemas (invoice fields, SEC filing tickers), a fine-tuned 100M-parameter encoder usually beats a 70B LLM on cost, speed, and hallucination rate. Use LLMs when schema churn is high or labeled data is scarce; use dedicated NER when volume and precision requirements are strict.

Training data and annotation discipline

NER quality is bounded by label quality. Before training:

  • Write a style guide with positive and negative examples per entity type.
  • Measure inter-annotator agreement (Cohen's kappa or span-level F1 between two labelers). Below ~0.8 on a pilot set, fix guidelines before scaling.
  • Include hard negatives: nested entities, ambiguous acronyms, partial mentions ("the Fed" vs "Federal Reserve").
  • Match training distribution to production: if users paste JSON and markdown, annotate messy formatting — not just clean newswire.

Data volume guidelines (rule of thumb): 1,000–5,000 labeled sentences per entity type for transformer fine-tuning in a narrow domain; more if types overlap heavily or text is noisy. Active learning — model flags uncertain spans for human review — stretches annotation budget further than random sampling.

Handling imbalanced types

Rare types (e.g. CVE_ID) get ignored unless you oversample sentences containing them, use class-weighted loss, or train a two-stage detector: binary "contains CVE?" filter, then span tagger. Metrics like macro-F1 surface per-type weakness that micro-F1 hides.

Entity linking and normalization

Recognizing "Apple" as ORG is step one. Entity linking (a.k.a. entity resolution) maps the span to a canonical ID: Wikidata Q312 (Apple Inc.), a CRM account UUID, or a Solana mint address. Linking disambiguates homographs and enables joins across documents.

Common patterns:

  • Gazetteer lookup — hash map from alias lists to IDs; fast for tickers and country names.
  • Candidate generation + ranking — retrieve top-k Wikipedia entities by surface form similarity, rerank with context embeddings.
  • Embedding nearest neighbor — encode "Apple announced earnings…" with an entity description index built from embedding models.

Linking errors compound silently. Log confidence scores and route low-confidence spans to human review or "unlinked entity" buckets rather than forcing a wrong ID into a knowledge graph.

Evaluation: span-level precision, recall, and F1

NER evaluation is span-based, not token-based (though token F1 is sometimes reported). A predicted span is correct only if both boundaries and entity type match gold labels exactly. Partial overlaps count as false positives and false negatives.

Report per-type precision, recall, and F1 plus micro and macro averages. On imbalanced corpora, macro-F1 reveals weak rare types. For production monitoring, track:

  • Drift in entity type distribution (sudden spike in ORG may mean model collapse).
  • Human audit sample on low-confidence predictions.
  • Downstream task metrics — does better NER actually improve search recall or ticket routing accuracy?

Standard public benchmarks (CoNLL-2003, OntoNotes, WNUT) are useful for model shopping but rarely predict in-domain performance. Always hold out a private test set drawn from your production text.

NER in modern AI pipelines

RAG and search metadata

Extracting entities at index time lets you attach structured filters: "documents mentioning ORG:Acme after DATE:2025-01-01." Combined with vector search, entity tags improve precision for analyst and legal workflows without replacing semantic retrieval.

PII detection and redaction

NER models trained on PER, EMAIL, PHONE, SSN types gate data before it enters logs or third-party LLM APIs. Regex alone misses contextual PII ("my number is the one I gave you yesterday"); NER plus rules catches more with fewer false negatives.

Agents and tool routing

When an AI agent parses "Schedule a call with Dr. Patel about the Q3 Boston audit," NER-derived slots populate calendar API parameters. Pair extraction with validation — a misparsed date should fail closed, not create a wrong meeting.

Production checklist

  1. Define entity types and annotation guidelines before labeling; pilot 200 sentences for agreement.
  2. Choose BIO or BIOES and keep training/inference decoding consistent.
  3. Benchmark spaCy/statistical, fine-tuned transformer, and LLM extraction on a private test set — optimize cost vs F1, not leaderboard scores.
  4. Handle subword alignment carefully; add unit tests for multi-token and single-token entities.
  5. Log spans with offsets, types, and confidence; never trust unvalidated LLM JSON in compliance paths.
  6. Add gazetteers for closed vocabularies (tickers, airport codes) as a post-processing or feature layer.
  7. Implement entity linking with explicit "unknown" when confidence is below threshold.
  8. Monitor per-type F1 on a weekly labeled sample; retrain when macro-F1 drops or new product vocabulary appears.
  9. Version models and schemas; breaking type renames require backfill jobs on stored extractions.
  10. Document nested-entity policy (flat vs layered spans) — mixed approaches confuse annotators and models.

Key takeaways

  • NER is span labeling — boundaries and types both matter for downstream use.
  • BIO tagging is the standard way to mark multi-token entities without merge errors.
  • Fine-tuned encoders win on volume and cost; LLMs win on schema flexibility when data is scarce.
  • Entity linking turns surface strings into actionable IDs — recognition alone is rarely enough.
  • Evaluate on your text — public benchmark F1 does not guarantee production performance.

Related reading