Guide

LLM codebase RAG explained

Harbor Engineering’s internal code assistant answered “Where do we validate JWT expiry before issuing refresh tokens?” with a chunk from README.md and two unrelated middleware files. The real logic lived in auth/token_service.rs, called from a handler three imports away. Document-style RAG — fixed 512-token windows over raw files — split the validation function across four chunks and never surfaced the call graph edge that mattered. On a 150-question eval set drawn from on-call tickets, correct-file recall@5 was 54% and answer accuracy was 41%. After rebuilding the index with AST-aware chunks, symbol metadata, and one-hop dependency expansion at query time, recall@5 rose to 87% and accuracy to 73% without changing the embedding model.

Codebase RAG retrieves source units — functions, classes, modules, config blocks — instead of prose paragraphs. Code has syntax, scopes, imports, and identifiers that fixed-size text splitting destroys. Production pipelines parse with tree-sitter or language servers, attach symbol tables, combine BM25 on identifiers with dense embeddings on docstrings and bodies, and expand context along import/call edges before synthesis. This guide covers why repo RAG differs from document RAG, chunking and metadata design, hybrid retrieval, graph expansion, the Harbor Engineering refactor, a technique decision table, pitfalls, and a checklist — complementing chunking strategy, hybrid search, and agentic RAG for multi-step repo navigation.

Why code is not a document corpus

Naive RAG treats a repository like a folder of text files. That fails for predictable reasons:

  • Syntax boundaries — splitting mid-function produces chunks that do not compile mentally; the model sees half a if block and invents the rest.
  • Identifier sparsity — dense embeddings underweight exact names like validate_refresh_token; BM25 or ripgrep-style lexical search often wins on symbol lookup.
  • Cross-file dependencies — the answer may be a 12-line helper in lib.rs while the question references a route in api/handlers.rs.
  • Generated and vendored codenode_modules, target/, protobuf output, and lockfiles pollute indexes unless excluded at ingest.
  • Version drift — stale chunks from another branch are worse than no retrieval; tie chunks to commit SHA and path.

Codebase RAG is retrieval engineering plus lightweight program analysis, not a bigger embedding model.

AST-aware chunking

Parse each file with a grammar-aware parser (tree-sitter is the common choice: one grammar per language, incremental re-parse on edit). Emit chunks at natural boundaries:

  • Primary unit — function, method, struct, enum, or class (language-dependent).
  • Overflow policy — if a function exceeds your token budget (~300–800 tokens for many code models), split on inner blocks (loops, match arms) while duplicating the signature in each child chunk as a header.
  • Module-level context — prepend imports, module doc comment, and enclosing type signature so standalone chunks remain readable.
  • Tests — index test functions separately with metadata kind: test; they often document intent better than production code.

Store stable chunk IDs: {repo}:{commit}:{path}#{symbol}:{start_line}-{end_line}. Reindex jobs diff by path and symbol hash rather than re-embedding the whole repo on every push.

Symbol tables and metadata payloads

Every chunk carries structured metadata for pre-filtering and display:

  • language, path, repo, commit_sha, branch (optional)
  • symbol_name, symbol_kind (function, class, trait, macro)
  • exports — public vs private
  • imports — resolved module paths where the indexer can compute them
  • callers / callees — static call graph edges from tree-sitter queries or LSP cross-references
  • last_modified, author (from git blame, if policy allows)

Filters like path:auth/* AND symbol_kind:function cut search space before ANN, mirroring vector metadata filtering patterns. Never index secrets: run a pre-ingest scanner for API keys, .env patterns, and PEM blocks; block or redact matches.

Hybrid retrieval: identifiers plus semantics

A practical query pipeline runs three channels and fuses results:

  1. Lexical — BM25 or ripgrep index on symbol names, paths, and comments. Strong for “find RefreshTokenService” queries.
  2. Dense — bi-encoder embeddings on signature + docstring + body. Use a code-tuned model (e.g. voyage-code, jina-embeddings-v2-base-code) or a general model with instruction prefix “Represent this code for retrieval.”
  3. Path heuristics — boost chunks whose path tokens overlap query terms (auth, jwt, middleware).

Merge with reciprocal rank fusion (RRF) or a lightweight cross-encoder reranker on the top 30 candidates. Cap final context at what your synthesis model tolerates — often 4–8 functions plus one hop of callees. See embedding fundamentals for model trade-offs.

Dependency expansion at query time

Top-k vector hits alone miss callers and callees. After initial retrieval:

  • Callee expansion — for each hit function, attach definitions of symbols it calls within the same repo (one hop default, two hops for small services).
  • Caller expansion — when the question is “who invokes X?”, reverse edges from the static graph beat semantic search.
  • Interface files — always include related .proto, OpenAPI specs, or trait definitions when a handler implements them.
  • Deduplication — expansion duplicates overlap; apply chunk deduplication before packing context.

For repos too large for one-shot context, delegate expansion to an agentic loop: search → read symbol → follow import → summarize → repeat until a budget cap.

Harbor Engineering refactor (worked example)

Harbor’s monorepo (~420k LOC, Rust API + TypeScript admin UI) powered an on-call Slack bot. Baseline: 512-token fixed chunks, OpenAI text-embedding-3-small, single vector index, no graph.

MetricBeforeAfter
Correct-file recall@554%87%
Answer accuracy (human graded)41%73%
Median retrieval latency180 ms240 ms
Indexed chunks (main branch)38,40011,200
Context tokens per answer (p50)6,1003,400

Changes: tree-sitter Rust/TS grammars; chunk = function or method; ripgrep-backed BM25 side index; RRF fusion; one-hop callee expansion; .gitignore respected at ingest; reindex on merge to main only. Fewer, denser chunks reduced noise and token spend despite slightly higher latency from expansion.

Technique decision table

ApproachBest whenWeak when
Fixed-size file chunks Prototypes, <50 files, docs-heavy repos Large polyglot monorepos, symbol lookup questions
AST function chunks + metadata Most production code assistants Heavy macro/metaprogramming where AST lies
Lexical-only (ripgrep/BM25) Exact symbol name known, IDE-style jump-to-def Behavioral “how does X work” questions
Dense-only vector Comment-rich code, conceptual similarity Rare identifiers, config keys, error codes
Graph expansion Call-chain and data-flow questions Dynamic dispatch, reflection, plugin systems
Agentic multi-step search Cross-service traces, refactors spanning dozens of files Low-latency chat, strict cost caps
Full-repo prompt (no RAG) Tiny libraries <32k tokens Any real monorepo

Common pitfalls

  • Indexing build artifacts — doubles index size and surfaces gibberish; honor .gitignore and custom deny lists.
  • Stale branch chunks — mixing main and feature-branch symbols confuses answers; scope indexes per branch or default to merged main only.
  • Split signatures from bodies — never embed a function body without its signature line; models hallucinate parameters.
  • Ignoring binary and generated files — minified JS and SQL migrations create junk hits.
  • Secret leakage — RAG over private repos still risks exfiltration via prompts; enforce ACL on metadata filters per team.
  • Over-expanding context — ten imported files drown the answer; cap expansion and rank by edge weight.
  • Skipping eval on real tickets — synthetic coding puzzles mislead; harvest questions from Slack, GitHub issues, and PR comments.

Production checklist

  • Define supported languages and tree-sitter grammars; fail closed on unknown extensions.
  • Chunk at function/class boundaries with signature headers on splits.
  • Attach symbol metadata and commit SHA to every vector payload.
  • Run secret scanning and license checks before ingest.
  • Build BM25 + dense indexes; fuse with RRF or rerank top 30.
  • Implement one-hop caller/callee expansion with dedup before synthesis.
  • Reindex incrementally on push to default branch; TTL stale branches.
  • Measure recall@k on file path and human answer accuracy weekly.
  • Log retrieved paths and commits for debugging wrong answers.
  • Document escalation to agentic search when single-shot retrieval fails.

Key takeaways

  • Source code needs syntax-aware chunks and symbol metadata — fixed token windows over raw files break functions and miss imports.
  • Hybrid lexical plus dense retrieval outperforms either alone on real developer questions that mix behavior and identifier names.
  • One-hop dependency expansion along call and import edges fixes cross-file answers that vector search alone cannot reach.
  • Harbor Engineering raised correct-file recall@5 from 54% to 87% with AST chunks, BM25 fusion, and callee expansion — fewer chunks, lower token spend.
  • Treat repos as versioned, ACL-scoped indexes: respect .gitignore, pin commit SHA, and never ingest secrets or build output.

Related reading