Guide
LLM codebase RAG explained
Harbor Engineering’s internal code assistant answered “Where do we
validate JWT expiry before issuing refresh tokens?” with a chunk from
README.md and two unrelated middleware files. The real logic lived
in auth/token_service.rs, called from a handler three imports
away. Document-style RAG — fixed 512-token windows over raw files —
split the validation function across four chunks and never surfaced the call
graph edge that mattered. On a 150-question eval set drawn from on-call
tickets, correct-file recall@5 was 54% and answer accuracy was 41%.
After rebuilding the index with AST-aware chunks, symbol
metadata, and one-hop dependency expansion at query time, recall@5 rose to
87% and accuracy to 73% without changing the embedding model.
Codebase RAG retrieves source units — functions, classes, modules, config blocks — instead of prose paragraphs. Code has syntax, scopes, imports, and identifiers that fixed-size text splitting destroys. Production pipelines parse with tree-sitter or language servers, attach symbol tables, combine BM25 on identifiers with dense embeddings on docstrings and bodies, and expand context along import/call edges before synthesis. This guide covers why repo RAG differs from document RAG, chunking and metadata design, hybrid retrieval, graph expansion, the Harbor Engineering refactor, a technique decision table, pitfalls, and a checklist — complementing chunking strategy, hybrid search, and agentic RAG for multi-step repo navigation.
Why code is not a document corpus
Naive RAG treats a repository like a folder of text files. That fails for predictable reasons:
- Syntax boundaries — splitting mid-function
produces chunks that do not compile mentally; the model sees half a
ifblock and invents the rest. - Identifier sparsity — dense embeddings underweight
exact names like
validate_refresh_token; BM25 or ripgrep-style lexical search often wins on symbol lookup. - Cross-file dependencies — the answer may be a
12-line helper in
lib.rswhile the question references a route inapi/handlers.rs. - Generated and vendored code —
node_modules,target/, protobuf output, and lockfiles pollute indexes unless excluded at ingest. - Version drift — stale chunks from another branch are worse than no retrieval; tie chunks to commit SHA and path.
Codebase RAG is retrieval engineering plus lightweight program analysis, not a bigger embedding model.
AST-aware chunking
Parse each file with a grammar-aware parser (tree-sitter is the common choice: one grammar per language, incremental re-parse on edit). Emit chunks at natural boundaries:
- Primary unit — function, method, struct, enum, or class (language-dependent).
- Overflow policy — if a function exceeds your token budget (~300–800 tokens for many code models), split on inner blocks (loops, match arms) while duplicating the signature in each child chunk as a header.
- Module-level context — prepend imports, module doc comment, and enclosing type signature so standalone chunks remain readable.
- Tests — index test functions separately with
metadata
kind: test; they often document intent better than production code.
Store stable chunk IDs:
{repo}:{commit}:{path}#{symbol}:{start_line}-{end_line}.
Reindex jobs diff by path and symbol hash rather than re-embedding the
whole repo on every push.
Symbol tables and metadata payloads
Every chunk carries structured metadata for pre-filtering and display:
language,path,repo,commit_sha,branch(optional)symbol_name,symbol_kind(function, class, trait, macro)exports— public vs privateimports— resolved module paths where the indexer can compute themcallers/callees— static call graph edges from tree-sitter queries or LSP cross-referenceslast_modified,author(from git blame, if policy allows)
Filters like path:auth/* AND symbol_kind:function cut search
space before ANN, mirroring
vector metadata filtering
patterns. Never index secrets: run a pre-ingest scanner for API keys,
.env patterns, and PEM blocks; block or redact matches.
Hybrid retrieval: identifiers plus semantics
A practical query pipeline runs three channels and fuses results:
- Lexical — BM25 or ripgrep index on symbol
names, paths, and comments. Strong for “find
RefreshTokenService” queries. - Dense — bi-encoder embeddings on
signature + docstring + body. Use a code-tuned model (e.g. voyage-code, jina-embeddings-v2-base-code) or a general model with instruction prefix “Represent this code for retrieval.” - Path heuristics — boost chunks whose path
tokens overlap query terms (
auth,jwt,middleware).
Merge with reciprocal rank fusion (RRF) or a lightweight cross-encoder reranker on the top 30 candidates. Cap final context at what your synthesis model tolerates — often 4–8 functions plus one hop of callees. See embedding fundamentals for model trade-offs.
Dependency expansion at query time
Top-k vector hits alone miss callers and callees. After initial retrieval:
- Callee expansion — for each hit function, attach definitions of symbols it calls within the same repo (one hop default, two hops for small services).
- Caller expansion — when the question is “who invokes X?”, reverse edges from the static graph beat semantic search.
- Interface files — always include related
.proto, OpenAPI specs, or trait definitions when a handler implements them. - Deduplication — expansion duplicates overlap; apply chunk deduplication before packing context.
For repos too large for one-shot context, delegate expansion to an agentic loop: search → read symbol → follow import → summarize → repeat until a budget cap.
Harbor Engineering refactor (worked example)
Harbor’s monorepo (~420k LOC, Rust API + TypeScript admin UI) powered
an on-call Slack bot. Baseline: 512-token fixed chunks, OpenAI
text-embedding-3-small, single vector index, no graph.
| Metric | Before | After |
|---|---|---|
| Correct-file recall@5 | 54% | 87% |
| Answer accuracy (human graded) | 41% | 73% |
| Median retrieval latency | 180 ms | 240 ms |
| Indexed chunks (main branch) | 38,400 | 11,200 |
| Context tokens per answer (p50) | 6,100 | 3,400 |
Changes: tree-sitter Rust/TS grammars; chunk = function or method;
ripgrep-backed BM25 side index; RRF fusion; one-hop callee expansion;
.gitignore respected at ingest; reindex on merge to
main only. Fewer, denser chunks reduced noise and token
spend despite slightly higher latency from expansion.
Technique decision table
| Approach | Best when | Weak when |
|---|---|---|
| Fixed-size file chunks | Prototypes, <50 files, docs-heavy repos | Large polyglot monorepos, symbol lookup questions |
| AST function chunks + metadata | Most production code assistants | Heavy macro/metaprogramming where AST lies |
| Lexical-only (ripgrep/BM25) | Exact symbol name known, IDE-style jump-to-def | Behavioral “how does X work” questions |
| Dense-only vector | Comment-rich code, conceptual similarity | Rare identifiers, config keys, error codes |
| Graph expansion | Call-chain and data-flow questions | Dynamic dispatch, reflection, plugin systems |
| Agentic multi-step search | Cross-service traces, refactors spanning dozens of files | Low-latency chat, strict cost caps |
| Full-repo prompt (no RAG) | Tiny libraries <32k tokens | Any real monorepo |
Common pitfalls
- Indexing build artifacts — doubles index size and
surfaces gibberish; honor
.gitignoreand custom deny lists. - Stale branch chunks — mixing
mainand feature-branch symbols confuses answers; scope indexes per branch or default to mergedmainonly. - Split signatures from bodies — never embed a function body without its signature line; models hallucinate parameters.
- Ignoring binary and generated files — minified JS and SQL migrations create junk hits.
- Secret leakage — RAG over private repos still risks exfiltration via prompts; enforce ACL on metadata filters per team.
- Over-expanding context — ten imported files drown the answer; cap expansion and rank by edge weight.
- Skipping eval on real tickets — synthetic coding puzzles mislead; harvest questions from Slack, GitHub issues, and PR comments.
Production checklist
- Define supported languages and tree-sitter grammars; fail closed on unknown extensions.
- Chunk at function/class boundaries with signature headers on splits.
- Attach symbol metadata and commit SHA to every vector payload.
- Run secret scanning and license checks before ingest.
- Build BM25 + dense indexes; fuse with RRF or rerank top 30.
- Implement one-hop caller/callee expansion with dedup before synthesis.
- Reindex incrementally on push to default branch; TTL stale branches.
- Measure recall@k on file path and human answer accuracy weekly.
- Log retrieved paths and commits for debugging wrong answers.
- Document escalation to agentic search when single-shot retrieval fails.
Key takeaways
- Source code needs syntax-aware chunks and symbol metadata — fixed token windows over raw files break functions and miss imports.
- Hybrid lexical plus dense retrieval outperforms either alone on real developer questions that mix behavior and identifier names.
- One-hop dependency expansion along call and import edges fixes cross-file answers that vector search alone cannot reach.
- Harbor Engineering raised correct-file recall@5 from 54% to 87% with AST chunks, BM25 fusion, and callee expansion — fewer chunks, lower token spend.
- Treat repos as versioned, ACL-scoped indexes: respect .gitignore, pin commit SHA, and never ingest secrets or build output.
Related reading
- RAG chunking strategies explained — fixed, semantic, and parent-child patterns adapted for prose corpora
- Hybrid search explained — BM25 plus dense fusion mechanics shared with repo indexes
- LLM embeddings explained — bi-encoder choice and similarity behavior for code passages
- Agentic RAG explained — multi-step retrieval when single-shot codebase search is insufficient