Guide

LLM codebase RAG explained

Harbor Engineering’s internal code assistant answered “Where do we validate JWT expiry before issuing refresh tokens?” with a chunk from README.md and two unrelated middleware files. The real logic lived in auth/token_service.rs, called from a handler three imports away. Document-style RAG — fixed 512-token windows over raw files — split the validation function across four chunks and never surfaced the call graph edge that mattered. On a 150-question eval set drawn from on-call tickets, correct-file recall@5 was 54% and answer accuracy was 41%. After rebuilding the index with AST-aware chunks, symbol metadata, and one-hop dependency expansion at query time, recall@5 rose to 87% and accuracy to 73% without changing the embedding model.

Codebase RAG retrieves source units — functions, classes, modules, config blocks — instead of prose paragraphs. Code has syntax, scopes, imports, and identifiers that fixed-size text splitting destroys. Production pipelines parse with tree-sitter or language servers, attach symbol tables, combine BM25 on identifiers with dense embeddings on docstrings and bodies, and expand context along import/call edges before synthesis. This guide covers why repo RAG differs from document RAG, chunking and metadata design, hybrid retrieval, graph expansion, the Harbor Engineering refactor, a technique decision table, pitfalls, and a checklist — complementing chunking strategy, hybrid search, and agentic RAG for multi-step repo navigation.

Why code is not a document corpus

Naive RAG treats a repository like a folder of text files. That fails for predictable reasons:

Syntax boundaries — splitting mid-function produces chunks that do not compile mentally; the model sees half a if block and invents the rest.
Identifier sparsity — dense embeddings underweight exact names like validate_refresh_token; BM25 or ripgrep-style lexical search often wins on symbol lookup.
Cross-file dependencies — the answer may be a 12-line helper in lib.rs while the question references a route in api/handlers.rs.
Generated and vendored code — node_modules, target/, protobuf output, and lockfiles pollute indexes unless excluded at ingest.
Version drift — stale chunks from another branch are worse than no retrieval; tie chunks to commit SHA and path.

Codebase RAG is retrieval engineering plus lightweight program analysis, not a bigger embedding model.

AST-aware chunking

Parse each file with a grammar-aware parser (tree-sitter is the common choice: one grammar per language, incremental re-parse on edit). Emit chunks at natural boundaries:

Primary unit — function, method, struct, enum, or class (language-dependent).
Overflow policy — if a function exceeds your token budget (~300–800 tokens for many code models), split on inner blocks (loops, match arms) while duplicating the signature in each child chunk as a header.
Module-level context — prepend imports, module doc comment, and enclosing type signature so standalone chunks remain readable.
Tests — index test functions separately with metadata kind: test; they often document intent better than production code.

Store stable chunk IDs: {repo}:{commit}:{path}#{symbol}:{start_line}-{end_line}. Reindex jobs diff by path and symbol hash rather than re-embedding the whole repo on every push.

Symbol tables and metadata payloads

Every chunk carries structured metadata for pre-filtering and display:

language, path, repo, commit_sha, branch (optional)
symbol_name, symbol_kind (function, class, trait, macro)
exports — public vs private
imports — resolved module paths where the indexer can compute them
callers / callees — static call graph edges from tree-sitter queries or LSP cross-references
last_modified, author (from git blame, if policy allows)

Filters like path:auth/* AND symbol_kind:function cut search space before ANN, mirroring vector metadata filtering patterns. Never index secrets: run a pre-ingest scanner for API keys, .env patterns, and PEM blocks; block or redact matches.

Hybrid retrieval: identifiers plus semantics

A practical query pipeline runs three channels and fuses results:

Lexical — BM25 or ripgrep index on symbol names, paths, and comments. Strong for “find RefreshTokenService” queries.
Dense — bi-encoder embeddings on signature + docstring + body. Use a code-tuned model (e.g. voyage-code, jina-embeddings-v2-base-code) or a general model with instruction prefix “Represent this code for retrieval.”
Path heuristics — boost chunks whose path tokens overlap query terms (auth, jwt, middleware).

Merge with reciprocal rank fusion (RRF) or a lightweight cross-encoder reranker on the top 30 candidates. Cap final context at what your synthesis model tolerates — often 4–8 functions plus one hop of callees. See embedding fundamentals for model trade-offs.

Dependency expansion at query time

Top-k vector hits alone miss callers and callees. After initial retrieval:

Callee expansion — for each hit function, attach definitions of symbols it calls within the same repo (one hop default, two hops for small services).
Caller expansion — when the question is “who invokes X?”, reverse edges from the static graph beat semantic search.
Interface files — always include related .proto, OpenAPI specs, or trait definitions when a handler implements them.
Deduplication — expansion duplicates overlap; apply chunk deduplication before packing context.

For repos too large for one-shot context, delegate expansion to an agentic loop: search → read symbol → follow import → summarize → repeat until a budget cap.

Harbor Engineering refactor (worked example)

Harbor’s monorepo (~420k LOC, Rust API + TypeScript admin UI) powered an on-call Slack bot. Baseline: 512-token fixed chunks, OpenAI text-embedding-3-small, single vector index, no graph.

Metric	Before	After
Correct-file recall@5	54%	87%
Answer accuracy (human graded)	41%	73%
Median retrieval latency	180 ms	240 ms
Indexed chunks (main branch)	38,400	11,200
Context tokens per answer (p50)	6,100	3,400

Changes: tree-sitter Rust/TS grammars; chunk = function or method; ripgrep-backed BM25 side index; RRF fusion; one-hop callee expansion; .gitignore respected at ingest; reindex on merge to main only. Fewer, denser chunks reduced noise and token spend despite slightly higher latency from expansion.

Technique decision table

Approach	Best when	Weak when
Fixed-size file chunks	Prototypes, <50 files, docs-heavy repos	Large polyglot monorepos, symbol lookup questions
AST function chunks + metadata	Most production code assistants	Heavy macro/metaprogramming where AST lies
Lexical-only (ripgrep/BM25)	Exact symbol name known, IDE-style jump-to-def	Behavioral “how does X work” questions
Dense-only vector	Comment-rich code, conceptual similarity	Rare identifiers, config keys, error codes
Graph expansion	Call-chain and data-flow questions	Dynamic dispatch, reflection, plugin systems
Agentic multi-step search	Cross-service traces, refactors spanning dozens of files	Low-latency chat, strict cost caps
Full-repo prompt (no RAG)	Tiny libraries <32k tokens	Any real monorepo

Common pitfalls

Indexing build artifacts — doubles index size and surfaces gibberish; honor .gitignore and custom deny lists.
Stale branch chunks — mixing main and feature-branch symbols confuses answers; scope indexes per branch or default to merged main only.
Split signatures from bodies — never embed a function body without its signature line; models hallucinate parameters.
Ignoring binary and generated files — minified JS and SQL migrations create junk hits.
Secret leakage — RAG over private repos still risks exfiltration via prompts; enforce ACL on metadata filters per team.
Over-expanding context — ten imported files drown the answer; cap expansion and rank by edge weight.
Skipping eval on real tickets — synthetic coding puzzles mislead; harvest questions from Slack, GitHub issues, and PR comments.

Production checklist

Define supported languages and tree-sitter grammars; fail closed on unknown extensions.
Chunk at function/class boundaries with signature headers on splits.
Attach symbol metadata and commit SHA to every vector payload.
Run secret scanning and license checks before ingest.
Build BM25 + dense indexes; fuse with RRF or rerank top 30.
Implement one-hop caller/callee expansion with dedup before synthesis.
Reindex incrementally on push to default branch; TTL stale branches.
Measure recall@k on file path and human answer accuracy weekly.
Log retrieved paths and commits for debugging wrong answers.
Document escalation to agentic search when single-shot retrieval fails.

Key takeaways

Source code needs syntax-aware chunks and symbol metadata — fixed token windows over raw files break functions and miss imports.
Hybrid lexical plus dense retrieval outperforms either alone on real developer questions that mix behavior and identifier names.
One-hop dependency expansion along call and import edges fixes cross-file answers that vector search alone cannot reach.
Harbor Engineering raised correct-file recall@5 from 54% to 87% with AST chunks, BM25 fusion, and callee expansion — fewer chunks, lower token spend.
Treat repos as versioned, ACL-scoped indexes: respect .gitignore, pin commit SHA, and never ingest secrets or build output.