Guide

LLM multilingual RAG explained

Harbor Support’s help center served 14% of tickets in Spanish and 6% in Portuguese, but the knowledge base was 92% English PDFs. The first fix — translate every user query to English, run monolingual RAG, translate the answer back — worked on simple refund questions and failed on policy nuance: “devolución” mapped to “return” when the relevant clause used “chargeback.” Retrieval recall on a 200-question multilingual audit set was 41%. Rebuilding with language detection, cross-lingual embeddings, and locale-tagged chunks raised recall to 78% while cutting double-translation latency from 4.2 s to 1.8 s.

Multilingual RAG is not “RAG plus Google Translate.” It is a pipeline that decides which language the user speaks, which language your documents are in, whether retrieval should bridge languages with aligned embedding spaces or explicit query translation, and which language the generator must answer in. This guide covers detection and routing, cross-lingual vs translate-then-retrieve strategies, document tagging and filtering, chunking for CJK and RTL scripts, evaluation across locales, the Harbor Support refactor, a technique decision table versus naive translation wrappers, pitfalls, and a production checklist.

What multilingual RAG solves

Standard RAG assumes query language matches corpus language. Real products break that assumption immediately: global SaaS docs stay English for years while users ask in dozens of languages; acquisitions import German PDFs into an English portal; support macros exist in three locales but the canonical policy lives in one. Without explicit multilingual design, teams either block non-English users or pipe everything through machine translation and inherit translation errors as retrieval errors.

Multilingual RAG separates three problems:

Understanding — detect or accept the user’s query language.
Retrieval — find evidence whether it is stored in the same language or not.
Generation — produce an answer in the user’s locale (or a chosen output language) grounded in retrieved text.

Confusing these steps is the main failure mode. Translating the final answer without fixing retrieval still returns wrong clauses; retrieving in English but generating in Spanish without instructing the model to quote carefully still produces hallucinations.

Language detection and routing

Start with fast, cheap language identification on the user message. Libraries like langdetect, FastText lid models, or cloud APIs return ISO 639-1 codes in milliseconds. For short queries (“refund?”), combine character n-grams with browser Accept-Language and account locale as weak priors — never override a confident detection with locale alone.

Route into one of three buckets:

Monolingual hit — corpus contains sufficient chunks in the query language; run same-language retrieval.
Cross-lingual bridge — corpus is another language; use cross-lingual embeddings or translate-query path.
Unsupported — no corpus coverage and model weak in that locale; return an honest fallback, not English pretending to be fluent.

Store doc_language metadata at ingest time (from file headers, OCR language ID, or manual tags). At query time, prefer same-language chunks when scores tie — users trust answers that cite local policy PDFs over translated excerpts.

Cross-lingual embeddings vs query translation

Two dominant architectures retrieve across languages:

Cross-lingual embedding models

Models such as multilingual E5 (mE5), Cohere embed-multilingual, BGE-M3, and LaBSE map text from many languages into a shared vector space. A Spanish query can neighbor an English chunk if they are semantically similar. Pair with hybrid search (BM25 + vectors) so rare proper nouns and SKU codes still match literally.

Strengths: one index, no per-query translation cost, handles mixed-language documents. Weaknesses: alignment quality varies by language pair; low-resource languages underperform; legal terms with no cross-lingual training pairs miss.

Translate-query-then-retrieve

Detect query language, machine-translate the query to the corpus primary language (usually English), embed or BM25 search, optionally translate retrieved snippets before generation. Strengths: works with monolingual English embedding models you already run in production. Weaknesses: translation errors propagate; double latency; idioms and regional variants degrade recall.

Production systems often hybridize: cross-lingual embeddings for recall, translate-query as a second retrieval pass, merge with reciprocal rank fusion. See embedding model selection for latency and dimension tradeoffs.

Chunking, tokenization and script-specific pitfalls

Multilingual corpora break naive chunking assumptions tuned on English prose:

CJK scripts — no whitespace boundaries; use character- or sentence-segmentation libraries (e.g. Jieba, Sudachi) before token-budget splits. Undersized chunks fragment kanji compounds; oversized chunks dilute embeddings.
German and Finnish compounds — long tokens inflate subword counts; reduce chunk token targets or split on headings.
RTL languages (Arabic, Hebrew) — store logical Unicode order; render bidirectionally in UI only. Metadata fields stay LTR.
Mixed-language pages — FAQ pages with English questions and localized answers need per-block language tags, not one tag per file.

At generation time, pass retrieved snippets in their source language with explicit instructions: “Answer in Spanish; quoted policy text may remain English; cite source IDs.” Forcing the model to silently translate evidence increases citation drift.

Answer locale control

Retrieval language and answer language are independent knobs. Common patterns:

Mirror user locale — Spanish question, Spanish answer, English sources cited with optional gloss.
Corporate lingua franca — internal analysts query in any language; answers always English for audit logs.
Parallel corpus — retrieve canonical English chunk, then fetch pre-approved localized paragraph by paragraph_id instead of live translation.

Encode the chosen locale in the system prompt and in structured output schemas. If you use structured outputs, add an answer_language field validated against an allowlist. Log locale mismatches (user es, answer en) as quality incidents.

Evaluation across languages

Monolingual golden sets lie. Build per-locale question sets with native-speaker labels, plus cross-lingual pairs: same question in English and Spanish should retrieve overlapping chunk IDs. Track RAG metrics separately by language: context recall often collapses for Portuguese before Spanish because corpus coverage differs.

LLM-as-judge evaluators must judge in the answer language or you risk approving fluent but wrong translations. Report faithfulness and answer relevance per locale; a 90% English score with 55% Thai score means you are not production-ready globally.

Harbor Support international docs refactor

Harbor Support replaced translate-everything with a four-stage pipeline:

Ingest — language ID per section; doc_language and locale_priority metadata on every chunk.
Index — mE5-large embeddings; Spanish and Portuguese macro libraries indexed alongside English PDFs.
Retrieve — same-language boost (+0.15 score); parallel English translate-query pass for legal PDFs; RRF merge top 12.
Generate — GPT-4o with “answer in {user_locale}; cite chunk_id; do not invent translated policy text”; fallback to pre-approved localized snippets when available.

Results on the 200-question audit: cross-lingual recall 41% → 78%; median latency 4.2 s → 1.8 s; human-rated answer quality 3.1 → 4.0 on a 5-point scale. Chargeback-vs-refund confusion dropped from 23 misfires to 4 after hybrid retrieval surfaced the English chargeback clause for Spanish queries.

Technique decision table

Scenario	Recommended approach	Avoid
Corpus mostly one language; few user locales	Translate-query + monolingual embeddings	Re-embedding entire corpus in mE5 without measuring gain
Many languages; mixed-language docs	Cross-lingual embeddings + per-block language tags	Single `lang=en` tag on whole PDF
Regulated text; approved translations exist	Parallel corpus by paragraph_id	Live MT of legal answers at inference time
SKU-heavy support (model numbers)	Hybrid BM25 + cross-lingual vectors	Embedding-only search
Low-resource language (e.g. Swahili)	Honest unsupported fallback + English option	Pretending fluency with thin corpus
Latency-sensitive chat	Same-language cache; one retrieval pass	Query translate + snippet translate + answer

Common pitfalls

Translating only the answer — retrieval still misses cross-lingual evidence.
One global chunk size — CJK and English need different token budgets.
Ignoring locale variants — es-MX and es-ES terminology differs; tag or test both.
English-only eval sets — ship multilingual metrics before marketing “global support.”
Silent translation of quotes — breaks audit trails and citation fidelity.
Language detection on code snippets — detect on natural-language portion only.
Over-trusting cross-lingual embeddings on legal terms — add keyword filters for defined terms.
No fallback copy — low-confidence detection should ask the user to confirm language.

Production checklist

Tag every chunk with doc_language (and script if mixed) at ingest.
Run language detection on queries; log confidence scores.
Choose cross-lingual embeddings, translate-query, or hybrid; document the decision.
Apply same-language score boost when ties occur.
Set answer locale in system prompt and structured output validation.
Build golden QA pairs in each supported user language.
Measure context recall and faithfulness per locale weekly.
Provide unsupported-language fallback with explicit user choice.
Cache frequent cross-lingual queries after human review.

Key takeaways

Multilingual RAG coordinates detection, cross-lingual retrieval, and answer locale — not just translation wrappers.
Cross-lingual embeddings and translate-query retrieval are complementary; hybrid fusion beats either alone on mixed corpora.
Chunking and metadata must respect script and per-block language boundaries.
Evaluate faithfulness per locale; strong English RAG can hide weak global coverage.
Harbor Support cut latency and raised recall by indexing locale tags, mE5 retrieval, and controlled answer-language generation.