Guide

Instruction tuning and SFT explained

Harbor Support's first-line chatbot ran on a 7B base model with a 2,400-token system prompt describing tone, refund policy, and escalation rules. It followed instructions inconsistently: some sessions opened with three paragraphs of boilerplate, others skipped mandatory fraud warnings. Median handle time was 4.2 minutes and 31% of chats escalated to humans. The refactor: curate 18,000 multi-turn transcripts from resolved tickets, run supervised fine-tuning (SFT) with LoRA adapters on a Mistral-7B-Instruct base, bake policy into demonstrations instead of prompts, and keep the live system prompt under 400 tokens. Escalation rate fell to 12%; CSAT rose 0.4 points. SFT did not make the model safer or more factual by itself — it taught the shape of a good support answer.

Instruction tuning is the family of techniques that train a language model to respond usefully to user commands rather than merely continue web text. The workhorse method is SFT: standard causal language modeling on curated instruction–response pairs, with loss computed only on assistant tokens. Popularized by InstructGPT and adopted across Llama-2-Chat, Mistral-Instruct, and enterprise fine-tunes, SFT is almost always the first stage of the RLHF alignment pipeline and the reference checkpoint for later DPO or PPO stages. This guide explains pretrain vs SFT vs preference learning, chat template design, label masking, dataset curation and quality filters, hyperparameter choices, evaluation beyond perplexity, the Harbor Support refactor, a technique decision table, common pitfalls, and a production checklist alongside our synthetic data guide and tokenization guide.

Where SFT sits in the alignment stack

Modern chat models pass through a layered post-training stack. Each layer optimizes a different objective; skipping or reordering stages usually hurts quality.

  1. Pretraining — next-token prediction on trillions of tokens. Produces broad knowledge and fluent text, not reliable instruction following.
  2. Supervised fine-tuning (SFT) — imitation learning on high-quality demonstrations: user prompts paired with ideal assistant replies. Teaches format, tone, task decomposition, and domain vocabulary.
  3. Preference optimizationRLHF, DPO, or constitutional AI on ranked completions. Refines style, helpfulness, and refusal behavior beyond what demonstrations alone encode.
  4. Optional capability stages — tool-use fine-tunes, reasoning RL (GRPO), or domain continued pretraining for code and math.

SFT is not optional for most product chatbots: prompting a raw base model into consistent JSON, bullet lists, and policy-compliant refusals requires enormous context and still drifts. Baking those patterns into weights via SFT shrinks prompts, stabilizes latency, and gives later alignment stages a sane starting policy. The trade-off is catastrophic forgetting if you over-train on a narrow domain — a 7B model fine-tuned only on legal FAQs may lose general coding ability unless you mix general instruction data.

Dataset design: what to put in SFT examples

SFT quality is dominated by data, not learning-rate sweeps. Each training example is typically a multi-turn conversation serialized through a chat template (ChatML, Llama-3, Mistral, etc.) that marks roles with special tokens: <|user|>, <|assistant|>, optional <|system|>.

Core fields per example

  • System message — short persona and hard constraints. Keep stable across the dataset; long per-example system prompts dilute learning.
  • User turn(s) — realistic prompts including typos, follow-ups, and ambiguous requests your product actually sees.
  • Assistant turn(s) — the gold response: correct facts, desired length, citations, tool calls, or refusal templates.
  • Optional tool traces — for agentic models, include function-call JSON and observation rounds so the model learns the call loop.

Sources include human-written demonstrations, expert edits of model drafts, exported support logs (PII-scrubbed), and synthetic data from a stronger teacher via self-instruct pipelines. Harbor Support mixed 60% human-edited ticket resolutions with 40% synthetic edge cases (chargebacks, partial refunds, account lockouts) rejected by an LLM-as-judge filter scoring policy adherence.

Quality filters worth running

  • Deduplication by MinHash or embedding cosine — near-duplicate prompts cause memorization and hurt generalization.
  • Length and language detection — drop empty, non-target-language, or truncated assistant replies.
  • Factuality spot checks on domain-critical fields (prices, dates, legal text).
  • Toxicity and PII scanners before training — SFT amplifies whatever you include.
  • Reward-model or teacher-model scoring to keep only top-quartile completions when scaling synthetic data.

Training mechanics: label masking and loss

SFT uses the same causal cross-entropy loss as pretraining, but masks labels so loss is computed only on assistant tokens (and sometimes tool-output tokens). User and system tokens receive label = -100 in PyTorch/Hugging Face trainers so they contribute to the forward pass for context but not to the gradient.

For a single-turn example, only the assistant's reply tokens accumulate loss. For multi-turn dialogs, practitioners differ: train on all assistant turns (recommended for support bots and agents) or only the final turn (common in some open datasets to save compute). Training on every assistant turn teaches intermediate clarification questions and partial tool results.

Chat templates must match inference

The template used at training time must byte-match deployment. A model SFT'd with Llama-3 headers but served with a raw Alpaca template will ignore instructions and leak special tokens. Store the template string in your model card and enforce it in the inference server (vLLM, TGI, llama.cpp).

Sequence packing

Padding short examples wastes GPU memory. Packing concatenates multiple conversations into one sequence up to max_seq_len, with attention masks preventing cross-example attention. FlashAttention-2 and TRL support packed SFT; verify label masks reset at pack boundaries. Harbor Support packed to 4,096 tokens with 12% throughput gain on A10G nodes.

Hyperparameters that usually matter

  • Learning rate — 1e-5 to 2e-5 for full fine-tune; 1e-4 to 3e-4 for LoRA on 7B–13B models. Cosine decay with 3–10% warmup is standard.
  • Epochs — 1–3 passes; more epochs on small specialized sets overfit fast. Watch held-out loss and task win-rate.
  • Batch size — effective batch 64–512 tokens worth of sequences via gradient accumulation.
  • Rank / alpha (LoRA) — rank 16–64 on q_proj, v_proj, and often k_proj, o_proj; alpha typically 2× rank.

Harbor Support SFT refactor (worked example)

The team started from mistralai/Mistral-7B-Instruct-v0.2 rather than the base checkpoint because the instruct variant already understood chat roles, reducing required demonstration volume. They exported 24,000 anonymized ticket threads, filtered to 18,200 after PII redaction and policy-violation removal, and formatted with the model's native template.

Training used QLoRA (4-bit NF4 base weights, rank-32 adapters on all attention and MLP projections), one epoch, learning rate 2e-4, max sequence length 4,096, and 8% held-out eval split stratified by issue type. Checkpoints were selected by combined score: human rubric on 200 blind prompts (tone, policy, correctness) plus automated JSON-schema validity for structured refund offers.

Post-SFT, a light DPO pass on 6,000 preference pairs (human rank of two candidate replies) further reduced verbosity. The key lesson: SFT fixed structure (always mention order ID field, never promise instant refunds); DPO fixed preference (shorter empathetic openers). Skipping SFT and jumping straight to DPO on the base model failed — preference data could not teach JSON refund templates the model had never seen.

Technique decision table

Approach Best when Limitations
Prompt engineering only Prototypes, low traffic, behavior changes weekly Long prompts, high latency/cost, inconsistent formatting
SFT (full or LoRA) Stable task format, domain tone, tool-call patterns, 1k–100k demos Does not fix factual errors; can overfit narrow data
RAG Fresh facts from docs, citations required Retrieval quality bound; does not teach conversational style alone
DPO / RLHF Pairwise preferences, tone polish, after SFT reference exists Needs preference data; weak at teaching new formats from scratch
Continued pretraining Massive domain corpus (code, medicine), vocabulary shift Expensive; may degrade instruction following without follow-up SFT

Compose layers: SFT for behavior shape, RAG for facts, hybrid retrieval for scale, DPO for preference alignment. Prompt engineering remains the fastest A/B lever on top of a fine-tuned checkpoint.

Evaluation: beyond training loss

Falling train loss with rising eval loss signals overfitting. Production teams track task-specific metrics instead:

  • Human or LLM-as-judge win-rate against the previous checkpoint on a frozen prompt set.
  • Schema validity rate for JSON, function calls, and regex extractions.
  • Policy violation rate on red-team and compliance prompts.
  • Regression suites for general capabilities (MMLU subset, coding snippets) to catch catastrophic forgetting.
  • Online metrics — escalation rate, CSAT, time-to-resolution after shadow deployment.

Perplexity on held-out demonstrations correlates weakly with user satisfaction. A model can achieve low perplexity by memorizing boilerplate while failing on rare refund edge cases. Stratify eval sets by frequency deciles.

Common pitfalls

  • Training on user tokens — Including loss on user prompts teaches the model to imitate customers, not assistants.
  • Template mismatch — Different special tokens at train vs serve is the most common silent SFT failure mode.
  • All synthetic, no human audit — Teacher-model errors compound; always human-review high-risk domains.
  • Too many epochs on small data — Memorizes ticket IDs and example names; rote responses on novel inputs.
  • Ignoring contamination — Benchmark questions in training data inflate offline scores; dedupe against eval sets.
  • Skipping mixed general data — Narrow SFT collapses general reasoning; blend 10–30% broad instruction data (Alpaca, ShareGPT filtered) when specializing.
  • Confusing SFT with safety — Demonstrations teach refusals only if refusals appear in data; add explicit refusal examples and follow with preference learning.
  • Wrong tokenizer assumptions — Multilingual or code heavy text needs tokenizer-aware length limits; byte-level BPE splits identifiers awkwardly.

Production checklist

  • Define target behaviors (format, tone, tools, refusals) before collecting data.
  • Choose base checkpoint (raw vs instruct) based on demonstration budget.
  • Normalize conversations through the exact chat template used at inference.
  • Mask loss on user and system tokens; decide multi-turn assistant masking policy.
  • Run deduplication, PII scrubbing, and quality scoring on all sources.
  • Hold out stratified eval set; never tune on it.
  • Start with LoRA/QLoRA; full fine-tune only when adapters plateau.
  • Log checkpoints with data hash, hyperparameters, and eval rubric scores.
  • Regression-test general capabilities after domain SFT.
  • Plan DPO or RLHF stage if pairwise preference data exists post-SFT.
  • Version control system prompts separately; shrink prompts as behaviors move into weights.

Key takeaways

  • SFT teaches instruction-following by imitating curated assistant demonstrations with masked causal loss.
  • It is the standard first alignment stage and the reference policy for DPO and RLHF.
  • Chat template parity between training and serving is non-negotiable.
  • Harbor Support cut escalations from 31% to 12% by SFT-ing policy-shaped demos before a light DPO pass.
  • Evaluate with task rubrics and online metrics, not training loss alone.

Related reading