Guide

LLM synthetic data generation explained

Fine-tuning and alignment need labeled examples — often thousands of them. Human annotation is slow, expensive, and hard to scale across edge cases. LLM synthetic data generation uses a capable teacher model (or the same model in a loop) to produce instruction-response pairs, preference rankings, classification labels, and tool-use traces that bootstrap smaller specialist models. Done well, synthetic data cuts time-to-train from months to days; done poorly, it encodes teacher hallucinations, collapses diversity, and poisons evaluation sets. This guide covers what synthetic data is good for, the main generation pipelines (self-instruct, evol-instruct, distillation), quality filters and deduplication, contamination and bias risks, how synthetic sets pair with fine-tuning and DPO, a Harbor Support ticket-classifier worked example, an approach decision table, common pitfalls, and a production checklist — alongside our data augmentation and LLM evaluation guides.

What synthetic data is (and is not)

Synthetic training data is machine-generated labeled text: a prompt, an expected completion, a chosen/rejected pair, or a structured JSON label. The labeler is typically an LLM prompted with seed examples, rubrics, or retrieval context — not a human, though humans often audit a sample.

Synthetic data is not a substitute for production logs. Real user queries carry typos, adversarial intent, domain jargon, and failure modes no seed prompt imagines. The winning pattern is hybrid: synthetic data for breadth and cold-start, human labels and live traffic for grounding and regression tests.

Common dataset types

Supervised fine-tuning (SFT) — {instruction, input, output} tuples teaching format, tone, or task behavior.
Preference pairs — prompt plus chosen and rejected completions for RLHF or DPO alignment.
Classification labels — text plus category, sentiment, or severity score for smaller encoder or adapter models.
Tool-use traces — multi-turn sequences showing function calls, arguments, and final answers for agent fine-tuning.
RAG QA pairs — question, retrieved context chunk, and grounded answer for retrieval-tuned pipelines.

Generation pipelines

Most production pipelines combine a seed set (50–500 human-written examples), a generator (frontier or open-weight teacher), and a filter stack that rejects low-quality rows before they reach training.

Self-instruct

The model reads seed tasks and invents new instructions in the same style. Each invented instruction is passed back to the model for a completion. Self-instruct (Wang et al., 2022) bootstrapped Alpaca-style datasets from a handful of human demos. Key levers: temperature for diversity, explicit bans on duplicate verbs/topics in the meta-prompt, and batch size per topic so one category does not dominate.

Evol-instruct and complexity scaling

Evol-instruct mutates existing instructions — add constraints, combine subtasks, require multi-step reasoning — then generates answers to the harder variants. Useful when you need hard negatives or complex tool chains without paying annotators per level. Cap evolution depth: beyond 3–4 rounds, instructions often become incoherent or impossible to verify.

Teacher distillation

Run a large teacher on real or synthetic prompts; store completions as targets for a smaller student. Distillation captures style and reasoning patterns cheaper than RL at scale. Pair with rejection sampling: generate k completions, keep only those passing programmatic checks or a verifier model — the same economics as knowledge distillation but at the data layer.

Back-translation and paraphrase

For classification and retrieval, paraphrase inputs to multiply labeled rows while preserving labels. LLM paraphrase is stronger than rule-based augmentation but can drift label boundaries — always re-score with a separate classifier before accepting.

Quality control and filtering

Unfiltered synthetic corpora look large and impressive until you train on them. Budget more engineering on filters than on generation.

Automated gates

Format validation — JSON schema, regex, max length, required fields.
Deduplication — MinHash or embedding cosine similarity; drop near-duplicates above 0.92 similarity.
Perplexity filter — discard completions the base model finds implausible (too easy or gibberish).
LLM-as-judge — rubric-scored relevance, factuality, toxicity; reject below threshold. Calibrate judges on human-labeled slice.
Programmatic verifiers — unit tests for code, calculator for math, retrieval overlap for RAG faithfulness.

Human spot checks

Review 1–5% stratified by topic and judge score. Track inter-annotator agreement between humans and the judge model; drift here predicts training failure. If humans disagree with the judge more than 15% on accepted rows, fix the rubric before scaling.

Diversity metrics

Monitor embedding cluster count, vocabulary richness, and per-class balance. A 50k-row set that collapses to twelve paraphrase templates will overfit tone, not task. Inject topic quotas and periodic re-seeding from production logs.

Risks: contamination, bias, and mode collapse

Benchmark contamination happens when synthetic prompts overlap eval sets or when teachers memorized benchmark answers. Hold out all public benchmarks; never use test questions as evolution seeds. Run n-gram and embedding overlap checks against eval corpora before training.

Teacher bias propagates: hedging tone, Western idioms, unsafe shortcuts. Mitigate with multi-teacher voting, locale-specific seed sets, and toxicity classifiers tuned on your policy, not generic APIs alone.

Mode collapse appears when the student is trained only on teacher outputs and then used to generate more data — each loop narrows vocabulary and reasoning paths. Limit recursive self-training to one or two rounds; refresh seeds from humans or fresh teacher snapshots.

Hallucinated labels are lethal for classification: the teacher may guess “refund” when the ticket is “billing inquiry.” Prefer human-labeled anchors per class and synthetic fill only around verified exemplars.

Worked example: Harbor Support ticket classifier

Harbor Support routes 40k monthly tickets into Billing, Shipping, Account, and Escalation. Human labeling covered 800 tickets; the team needed 8k rows to fine-tune a small classifier before peak season.

Seed — 80 human-labeled tickets per class, including messy real typos and partial order IDs.
Generate — GPT-4 class prompted: “Write a customer email that would be labeled {class}; vary length, emotion, and language.” 200 drafts per class at temperature 0.9.
Label — Same teacher labels each draft with chain-of-thought hidden; store only the category field.
Filter — Drop rows where a separate DeBERTa verifier disagrees with the teacher label; drop duplicates via embedding similarity; reject any row mentioning banned PII patterns.
Mix — Final set: 60% synthetic, 40% human (upsampled human rows). Train LoRA on a 7B base for three epochs.
Eval — Holdout 500 fully human tickets; macro-F1 0.91 vs 0.84 SFT-on-seeds-only. Escalation recall improved most (+9 points).

Production guardrail: any ticket with payment dispute keywords bypasses the classifier straight to Escalation regardless of model score — synthetic data never overrides hard rules.

Approach decision table

Scenario	Recommended approach	Why
Cold-start intent classifier, <500 human labels	Self-instruct + verifier filter + human audit	Breadth fast; verifier catches label noise
Chat tone / style alignment	Teacher distillation from frontier model + DPO pairs	Style transfers well; preferences cheap to rank synthetically
High-stakes medical/legal answers	Human-primary; synthetic only for paraphrase	Hallucinated labels carry liability
Code generation SFT	Evol-instruct + unit-test rejection sampling	Executable tests filter invalid completions
RAG grounding	Generate Q from docs; answers must cite chunk IDs	Constrains teacher to retrieved evidence
Agent tool use	Replay logs + synthetic edge cases with schema validation	Real traces anchor; synthetic fills rare tools

Common pitfalls

Scale before quality — 100k unfiltered rows train a confident wrong model.
Single teacher — one model’s blind spots become your product’s blind spots.
Eval on synthetic — reporting accuracy on generated test sets inflates metrics; always keep human holdouts.
Ignoring production drift — synthetic sets go stale; schedule quarterly refresh from live logs.
Skipping dedup — near-duplicate instructions cause memorization, not generalization.
Recursive loops without anchors — student-generated data degrades within 2–3 iterations.
Copyright and PII leakage — teachers may regurgitate training data; scan for secrets and licensed text.

Production checklist

Documented seed set with provenance (human vs imported).
Generation prompts versioned; temperature and model ID logged per batch.
Filter pipeline with rejection-rate dashboard per stage.
Human audit sample (1–5%) with score agreement tracked weekly.
Deduplication threshold tuned; cluster diversity metric on dashboard.
Benchmark contamination check against all eval corpora.
Class/topic balance report; rebalancing before train.
Mix ratio synthetic/human documented; ablation on human-only baseline.
Holdout eval set 100% human-labeled, never shown to generator.
Post-train regression on live shadow traffic before full rollout.
Refresh cadence tied to product change log (new SKUs, policies, locales).
Hard-rule overrides for safety-critical routes regardless of model score.

Key takeaways

Synthetic data accelerates fine-tuning and alignment when filters are strict and humans anchor the tail.
Self-instruct and evol-instruct provide breadth; distillation and rejection sampling provide depth.
LLM-as-judge scales quality control but must be calibrated against human labels.
Contamination, mode collapse, and hallucinated labels are the main failure modes — design pipelines to detect them early.
Hybrid datasets (synthetic + real logs + human audit) outperform either alone on production metrics.