Guide

LLM embedding fine-tuning explained

Harbor Support's RAG pipeline indexed 12,000 help articles with a general-purpose embedding model. Recall@10 on a held-out query set sat at 71% — acceptable for generic FAQs, but catastrophic on product-specific language: “ACH reversal window,” “provisional credit SLA,” and internal SKU codes retrieved the wrong articles half the time. Swapping to a larger off-the-shelf model lifted recall to 78% at 3x embedding cost. Fine-tuning the same 110M base encoder on 4,200 labeled query–passage pairs from support tickets pushed recall@10 to 89% with no inference latency change. That is embedding fine-tuning: adapting a bi-encoder's vector space so cosine similarity in your domain correlates with relevance, not just general semantic nearness. It is narrower than fine-tuning the generator and cheaper than running a cross-encoder on every candidate. This guide covers when embedding adaptation beats alternatives, contrastive training objectives, data collection and hard-negative mining, LoRA and full fine-tune tradeoffs, evaluation design, the Harbor Support retrieval refactor, a technique decision table, pitfalls, and a production checklist.

What embedding fine-tuning changes (and what it does not)

A bi-encoder maps queries and documents independently into a shared vector space. Fine-tuning adjusts the encoder weights so positive query–document pairs sit closer than negatives. You are not teaching the model new facts — you are reshaping the geometry of retrieval.

Layer Fine-tuning changes Does not change
Encoder weights Domain vocabulary, acronym alignment, query phrasing patterns Corpus content (still need fresh indexing)
Vector index Must re-embed all documents after any encoder update Chunking strategy (orthogonal decision)
Downstream LLM Better passages in context → fewer hallucinations Generation style, tool use, safety policy
Latency / cost Same as base model if size unchanged Does not replace reranking for top-3 precision

Training objective taxonomy

Most production embedding fine-tunes use contrastive learning. Pick the objective that matches your label quality and batch size.

Contrastive pair types

Pair type Structure Best when Weakness
Positive query–passage User query + gold article chunk Click logs, agent resolutions, human labels Positives may be noisy (user clicked wrong doc)
Triplet (anchor, positive, negative) Query + relevant + irrelevant passage Hard negatives available from retrieval misses Easy negatives teach little; curriculum matters
In-batch negatives Other positives in batch act as negatives Large batch sizes (64–512 pairs) False negatives when batch contains similar topics
Symmetric (E5-style) Both sides prefixed “query:” / “passage:” Mixed query and document retrieval Requires consistent prefix at inference

Loss functions

  • InfoNCE / softmax contrastive — standard for in-batch negatives; temperature hyperparameter controls hardness.
  • Triplet margin loss — explicit push-away from hard negatives; sensitive to margin choice.
  • Multiple negatives ranking (MNRL) — popular in sentence-transformers; combines in-batch and mined negatives efficiently.

Harbor Support trained with MNRL on batches of 128, using a temperature of 0.05 and mixed-precision on a single A10. Three epochs over 4,200 pairs took under two hours; gains plateaued by epoch 2 on validation MRR.

Data collection and hard-negative mining

Embedding fine-tuning lives or dies on training data. A few hundred high-quality pairs beat tens of thousands of synthetic paraphrases.

Positive pair sources

  • Search click logs — query + clicked document, filtered by dwell time and no immediate re-search.
  • Support ticket resolutions — customer question + article the agent actually used.
  • Human relevance labels — annotators score query–passage pairs on a 0–3 scale; keep only 2–3 for positives.
  • Synthetic query generation — LLM writes questions from passages; useful for bootstrapping but validate on real queries.

Hard negative mining pipeline

  1. Embed the corpus with the current (base or checkpoint) encoder.
  2. For each training query, retrieve top-20 passages by cosine similarity.
  3. Label or heuristically filter: passages that rank high but are not relevant become hard negatives.
  4. Re-train; repeat mining each epoch (iterative hard-negative curriculum).

Easy negatives (“completely unrelated topic”) inflate offline metrics without improving production recall. Harbor's biggest lift came from mining negatives that the base model ranked 2–5 but humans marked irrelevant — near-miss confusions like “refund policy” vs “return shipping label.”

Full fine-tune vs LoRA adapters

Approach VRAM / time Typical gain When to choose
Full fine-tune Highest; all layers update Maximum domain shift Small base model (<400M), large labeled set (>5k pairs)
LoRA on attention layers ~30–40% of full FT memory 80–95% of full FT recall gain Default for 100M–1B encoders; multiple domain adapters
Last-layer only Minimal Small vocabulary tweaks Quick experiment; rarely enough alone

LoRA rank 16–32 on query and value projections was sufficient for Harbor Support. They version adapters per product line (payments vs shipping) and hot-swap at index query time without maintaining separate full models.

Evaluation: metrics that predict production quality

Offline retrieval metrics must mirror your production cutoffs. See RAG evaluation for end-to-end answer quality; here focus on the retrieval layer.

  • Recall@k — fraction of queries where a relevant passage appears in top-k. Match k to your RAG context window (usually k=5–20).
  • MRR (mean reciprocal rank) — rewards getting the best passage to rank 1; sensitive to single-gold-label setups.
  • nDCG@k — handles graded relevance (partially useful passages).
  • Hit rate on tail queries — slice eval by query frequency; head-query gains can mask tail failures.

Hold out queries from the same time period as training (temporal split) to catch vocabulary drift. Harbor added a “confusion set” of 80 acronym-heavy queries that never appeared in training; adapter fine-tune improved this slice from 52% to 84% recall@10 while head queries moved 91% to 93%.

Harbor Support retrieval refactor (worked example)

Before fine-tuning, the pipeline was: chunk articles at 512 tokens, embed with bge-small-en-v1.5, store in a vector database, retrieve top-10, pass top-5 to the generator. Failure mode: correct article existed but ranked 14th on domain jargon.

  1. Label 4,200 pairs from six months of tickets where agents linked a KB article.
  2. Mine hard negatives from base-model top-20 per query; two annotators agreed on irrelevance.
  3. LoRA fine-tune bge-small for 2 epochs; early stop on validation MRR.
  4. Re-embed entire corpus (12k chunks, ~8 minutes on one GPU).
  5. A/B test for two weeks: fine-tuned index vs baseline, same generator and prompts.

Results: recall@10 +18 points offline; ticket deflection +11% online; no change in p95 embedding latency. They kept a cross-encoder reranker on top-20 for final top-5 selection — fine-tuning and reranking are complementary, not either/or.

Technique decision table: embedding FT vs alternatives

Problem signal Try first Escalate to embedding FT when Wrong move
Low recall, generic queries work Better chunking, hybrid BM25 Domain jargon misses persist after hybrid Jump to largest embedding model without labels
Right doc in top-50, wrong top-5 Cross-encoder reranker Reranker latency or cost too high at scale Fine-tune generator before fixing retrieval
Acronym / SKU confusion Synonym map, query expansion >500 labeled pairs with consistent patterns Prompt-only fixes for retrieval geometry
Multi-product single index Metadata filters, per-product indexes Per-domain LoRA adapters + shared base One global FT on mixed irrelevant negatives
<200 labeled pairs Few-shot query rewriting, larger off-the-shelf model Collect more labels before training Overfit a full fine-tune on tiny data

Common pitfalls

  • Training on passages, retrieving chunks — positives must match the exact granularity stored in the index.
  • Skipping corpus re-embed — old vectors in the index make fine-tuning invisible to users.
  • Prefix mismatch — E5 and similar models require “query:” / “passage:” prefixes at train and inference.
  • Leaking test queries into mining — hard negatives mined from eval set inflate metrics.
  • Only easy negatives — random corpus samples do not teach fine discrimination.
  • Optimizing recall@100 — production uses k=5–10; tune metrics to match.
  • Ignoring query distribution shift — synthetic training queries that do not match real user phrasing.

Production checklist

  • Baseline recall@k and MRR on a frozen eval set before any training.
  • Collect at least 500–1,000 human-verified query–passage positives (more for full FT).
  • Run hard-negative mining from current retriever; include near-miss confusions.
  • Match training chunk boundaries to production index chunks.
  • Start with LoRA; compare against full FT only if LoRA plateaus.
  • Version encoder weights and adapter checkpoints in a model registry.
  • Re-embed full corpus atomically (blue/green index swap).
  • A/B test online deflection or answer correctness, not just offline MRR.
  • Monitor embedding drift when corpus or query mix changes quarterly.
  • Pair fine-tuned bi-encoder with reranker if top-3 precision still matters.

Key takeaways

  • Embedding fine-tuning reshapes retrieval geometry; it does not add knowledge to the corpus.
  • Hard negatives from your retriever's near-misses drive most of the gain over easy random negatives.
  • LoRA adapters deliver most of full fine-tune quality at lower cost and enable per-domain swapping.
  • Harbor Support gained +18 recall@10 and +11% deflection by fine-tuning on ticket-linked pairs, not by scaling model size.
  • Always re-embed the corpus and evaluate with temporal splits that match real query drift.

Related reading