Guide
LLM FLARE active retrieval explained
Harbor Engineering’s internal onboarding assistant indexed 2,400 runbooks, service READMEs, and architecture decision records. New hires asked procedural questions: “How do I deploy the payments service to staging with the current Helm chart?” Single-shot RAG retrieved a generic deployment page but missed the chart’s pinned dependency table buried in a separate values.yaml doc. The model guessed version pins and skipped required secrets. On 120 setup probes, incomplete or wrong-version answers appeared in 37% of responses even when the correct passages existed elsewhere in the corpus.
The team deployed FLARE (Forward-Looking Active REtrieval): the generator writes a partial answer, monitors token-level confidence, and when uncertainty spikes it pauses, drafts a forward-looking retrieval query from what it has written plus what it still needs, fetches fresh chunks, injects them, and continues. Incomplete-answer rate fell to 11%; answer faithfulness on the golden set rose from 58% to 84%. This guide covers uncertainty triggers, query formulation, context injection patterns, the Harbor Engineering refactor, a technique decision table versus multi-hop RAG and single-shot retrieval, pitfalls, and a production checklist.
Why retrieve before generation is not always enough
Classic RAG retrieves once from the user query, then generates. That works when the query fully specifies what facts the answer needs. It fails when the answer unfolds: step two depends on a version number mentioned only after step one is drafted, or a policy exception appears mid-paragraph. The model either hallucinates the missing fact or stays vague.
Pre-retrieval expansions like HyDE and multi-query paraphrase widen the first search but still cannot know which facts will be needed after partial synthesis. Multi-hop pipelines plan sub-questions upfront; FLARE discovers retrieval needs during generation when the model itself signals ignorance via low token probability.
FLARE sits between lightweight single-shot RAG and heavyweight agentic RAG: retrieval is triggered by the generator’s own uncertainty, not by a separate planner with open-ended tool loops. Cost stays more predictable than full ReAct agents while fixing a specific failure mode — answers that go off-rails halfway through.
How FLARE works: uncertainty, pause, retrieve, resume
The FLARE paper (Jiang et al., 2023) formalizes active retrieval augmented generation. Production implementations share four stages:
1. Draft generation with confidence monitoring
The LLM generates tokens autoregressively. After each token (or each sentence boundary for efficiency), the runtime inspects logprobs or an equivalent confidence signal. When the probability of the chosen token falls below a threshold τ, or when entropy across the top-k candidates exceeds a ceiling, generation pauses at a safe boundary (sentence end, clause comma, or paragraph break — never mid-word).
2. Forward-looking query formulation
The system does not re-issue the original user query. It constructs a
lookahead query from: (a) the user question, (b) the
partial answer drafted so far, and (c) optional masked placeholders for
missing facts. A small prompt or the same LLM in a structured mode outputs
a search string like payments-service Helm chart staging image tag
2026-Q2 values.yaml — tuned to what the draft still needs.
3. Retrieval and deduplication
The new query hits the same hybrid index (BM25 + vectors) used at turn start. Retrieved chunks are deduplicated against passages already in context via chunk ID sets. Overlap with prior retrievals is common; dedup prevents context bloat. Rank fusion and a cross-encoder reranker (see LLM reranking) keep only top-2–3 new passages per FLARE iteration.
4. Context injection and resume
New passages append to a working retrieval buffer (not always the full chat history). Generation resumes from the pause point with updated context. Implementations cap FLARE iterations per answer (typically 2–4) to bound latency and cost. If confidence stays low after max iterations, abstain or escalate rather than guess.
Uncertainty triggers: what to measure in production
Token probability alone is noisy. Calibrate triggers on a labeled dev set where you know which sentences required extra retrieval:
- Min-token logprob threshold — pause when any token in a sentence scores below τ (e.g. −2.0 nat). Simple but fires on rare proper nouns; whitelist entity tokens or use entity linking first.
- Mean sentence logprob — smoother signal; pause when the average across a completed sentence drops below τ.
- Entropy spike — high entropy means the model is split between several continuations; often correlates with missing facts.
- Learned critic head — a small classifier on hidden states predicting “needs retrieval”; trained on traces where humans marked unsupported spans. Lower false-positive rate than raw logprobs.
- Structured self-check — after each sentence, a JSON
call:
{"needs_fact": true, "missing": "helm image tag"}. Higher latency than logprobs but easier to debug.
Harbor Engineering started with mean sentence logprob (τ = −1.4) plus a blocklist for internal service codenames that always triggered false pauses. They added a learned critic after two weeks when false triggers inflated p95 latency by 40% on FAQ-style questions that needed no mid-generation search.
Harbor Engineering refactor (worked example)
Before FLARE: hybrid index, query expansion with two paraphrases, top-6 fusion, cross-encoder rerank to top-4, single-shot GPT-4-class synthesis. Onboarding probe results:
- Answer faithfulness (SME rubric): 58%
- Incomplete or wrong-version answers: 37%
- Retrieval recall@20 on golden queries: 71% (failure was often second-hop facts, not first-hop)
- p95 end-to-end latency: 2.4 s
After FLARE (initial top-4 unchanged, max 3 FLARE iterations, mean logprob trigger, forward-query via structured prompt, +2 chunks per iteration, abstain if still low confidence):
- Faithfulness: 58% → 84%
- Incomplete/wrong-version: 37% → 11%
- FLARE triggered on 42% of queries (mostly procedural multi-step questions)
- Mean FLARE iterations when triggered: 1.6
- p95 latency: 2.4 s → 4.1 s (+1.7 s; acceptable for internal tool)
- Generator + retrieval cost: +94% tokens on triggered queries; +38% blended average
Multi-hop decomposition had been tried: sub-questions were hard to enumerate for open-ended setup flows. FLARE let the model discover “I need the image tag” only after drafting the Helm install step — a dependency single-shot retrieval could not predict.
Technique decision table
| Approach | Best when | Weak when |
|---|---|---|
| Single-shot RAG + rerank | Answers are one fact or one paragraph; query specifies all needs | Procedural answers that unfold across documents |
| Query expansion / HyDE | Recall is low on first hop; vocabulary mismatch | Second-hop facts unknown until draft exists |
| Multi-hop / decomposed RAG | Bridge entities are predictable; QA benchmarks with known hops | Open-ended workflows; hard to pre-plan sub-questions |
| FLARE (active retrieval) | Generation reveals missing facts; procedural and explanatory answers | Strict sub-second latency; API lacks logprobs |
| Agentic RAG (ReAct) | Tools, SQL, calculators, unpredictable multi-step plans | Cost/latency unpredictable; overkill for doc QA |
| Self-RAG reflection tokens | Need sentence-level grounding critique and abstain loops | Main failure is missing facts mid-draft, not unsupported sentences |
FLARE pairs well with a strong first retrieval pass: initial top-k handles obvious context; FLARE fills gaps the draft exposes. Running FLARE without any upfront retrieval wastes iterations rediscovering chunks the user query already implied.
Implementation patterns
Streaming UX
Pause generation visibly: show “Looking up…” when FLARE fires so users do not think the stream stalled. Buffer tokens during retrieval; resume streaming the same answer rather than restarting from scratch when possible.
Context window budget
Each FLARE iteration adds chunks. Use a sliding retrieval buffer: keep initial top-k plus only the last N FLARE passages, or compress older passages with extractive compression before the next resume. Monitor lost-in-the-middle placement — sandwich new facts near the pause point, not at buffer tail.
API constraints
FLARE requires logprobs or a parallel confidence signal. Some hosted APIs hide logprobs on certain models; fall back to sentence-level structured self-check or a small local critic model on hidden states if your provider exposes them. Without any confidence signal, use fixed sentence-boundary retrieval (retrieve after every N sentences) as a crude approximation — higher cost, noisier.
Common pitfalls
- Triggering on every rare token — product names and version strings have low logprob but are retrievable from the first hop; tune thresholds on a dev set, don’t copy paper defaults.
- Re-querying the user question — forward-looking queries must include the partial draft; otherwise FLARE duplicates first-hop results.
- Unbounded iterations — cap at 2–4 retrievals; infinite loops burn budget and still hallucinate.
- No chunk deduplication — repeated passages push out useful context and inflate prefill cost.
- Mid-word pauses — break only on sentence or clause boundaries to avoid garbled resume points.
- Skipping first-hop retrieval — FLARE supplements; it does not replace an initial retrieve from the user query.
- No abstain path — if confidence stays low after max iterations, say “cannot verify in docs” instead of shipping guesses.
Production checklist
- Keep a strong single-shot first retrieval; add FLARE only for unfold answers.
- Calibrate uncertainty threshold on sentences labeled needs-retrieval vs not.
- Implement forward-query prompt with user Q + partial draft + missing-slot hint.
- Deduplicate by chunk ID across initial and FLARE retrievals.
- Cap FLARE iterations (2–4) and total added chunks per answer.
- Log trigger reason, forward query, chunks added, and iteration count per turn.
- Measure faithfulness and incomplete-answer rate before/after on the same probe set.
- Track p50/p95 latency split: FLARE-triggered vs non-triggered queries.
- Expose pause state in streaming UI so users understand retrieval delays.
- Define abstain copy when max iterations exhaust without confidence recovery.
Key takeaways
- FLARE retrieves during generation when the model signals uncertainty — not only before the first token.
- Forward-looking queries use the partial draft to search for facts the answer still needs.
- Best for procedural and unfolding answers where second-hop needs are unpredictable upfront.
- Harbor Engineering cut incomplete setup answers from 37% to 11% with +1.7 s p95 on triggered queries.
- Cap iterations, dedupe chunks, and pair FLARE with a solid first-hop retrieve — not as a replacement.
Related reading
- RAG multi-hop retrieval explained — planned sub-questions and bridge entities
- Agentic RAG explained — open-ended tool loops and ReAct orchestration
- Query expansion for retrieval — HyDE and multi-query first-hop recall
- RAG evaluation explained — faithfulness metrics and golden QA sets