Guide

LLM bias and fairness explained

A hiring assistant ranks two identical resumes differently because one candidate’s name correlates with a demographic group underrepresented in the training corpus. A medical chatbot describes pain differently depending on gendered pronouns in the prompt. A support router escalates tickets from certain dialects more often — not because the issue is harder, but because the classifier confuses informal phrasing with hostility. These are not edge cases; they are predictable outcomes when LLM bias and fairness are left to “the model vendor handled it.” This guide separates representational harm from allocative unfairness, traces how bias enters pre-training, alignment, RAG, and prompts; explains fairness metrics and counterfactual evaluation; maps mitigation across the stack; walks through a Harbor Support ticket-routing fairness audit; compares strategies in a decision table; lists common pitfalls; and ends with a production checklist. For privacy risks that intersect with fairness, see our LLM data privacy guide; for output safety rails, see LLM guardrails and LLM evaluation and benchmarking.

Two kinds of harm: representational vs allocative

Fairness work fails when teams optimize the wrong failure mode. Split the problem before choosing metrics.

Representational harm

The model describes groups in stereotyped, erasing, or demeaning ways: defaulting doctors to “he,” associating certain ethnicities with crime in open-ended completions, or refusing neutral tasks for dialects labeled “non-standard.” Harm is visible in generated text even when no decision is made. Mitigation targets training data balance, instruction tuning, and post-generation filters.

Allocative harm

The model allocates resources unequally: loan approvals, support priority, content moderation strikes, hiring shortlists. Harm is measured in disparate error rates or outcomes across protected or proxy attributes. Mitigation targets threshold tuning, human review for edge cases, and sometimes abstaining from automated decisions entirely.

A single product can cause both: a résumé screener that stereotypes language (representational) and ranks candidates unequally (allocative). Your evaluation plan must cover each path independently — passing a toxicity benchmark does not prove equal false-negative rates on routing.

Where bias enters the LLM stack

Foundation models inherit the web’s skew: overrepresentation of English, Western institutions, and dominant demographics; underrepresentation of indigenous languages, disability perspectives, and global south contexts. Scaling data does not automatically fix skew — it can amplify majority patterns.

Pre-training and fine-tuning

Co-occurrence statistics become stereotypes (“nurse” ↔ “she”). Supervised fine-tuning on imbalanced human demonstrations encodes annotator bias. RLHF reward models trained on majority raters penalize valid dialect or cultural framing. Mitigation: stratified sampling, diverse annotator pools, bias-aware loss weighting, and held-out slice evaluation before deployment.

RAG and retrieval

If your knowledge base over-indexes policy docs written for one region, answers silently exclude others. Embedding models can rank semantically equivalent queries differently by dialect. See RAG fundamentals for architecture; fairness requires auditing retrieved chunks per demographic slice, not just answer fluency.

Prompts, tools, and orchestration

System prompts that ask the model to “detect suspicious behavior” without rubrics import cultural assumptions. Tool calls that pass zip codes into risk scorers recreate redlining through proxies. Agent loops amplify small biases across steps. Document every attribute the pipeline touches — explicit or inferred.

Fairness metrics that survive a legal review

No single metric is universally “fair.” Pick metrics tied to your harm model and jurisdiction, then report trade-offs honestly.

Group fairness definitions

Demographic parity — positive outcome rate equal across groups: P(positive | A=a) = P(positive | A=b). Simple to audit; can conflict with base-rate differences in legitimate predictors.
Equalized odds — equal true-positive and false-positive rates across groups. Strong when false positives and false negatives carry asymmetric cost (e.g. wrongful fraud flag vs missed fraud).
Calibration within groups — among applicants scored 70%, 70% should succeed in each group. Important for risk scores; can be incompatible with equalized odds when base rates differ.
Individual fairness — similar individuals receive similar outcomes. Operationalized via Lipschitz constraints or metric learning; hard at LLM scale but useful for embedding-based routing.

Impossibility and context

When base rates differ and you use imperfect predictors, demographic parity, equalized odds, and calibration cannot all hold simultaneously (the fairness impossibility theorem). Product and legal stakeholders must choose which error to prioritize — document that choice. In EU and US employment and credit contexts, disparate impact analysis often focuses on outcome rates and business necessity, not model logits.

Evaluation: counterfactuals, slices, and red teams

Aggregate accuracy hides subgroup failure. Build evaluation that stresses the attributes you cannot legally use at inference but must still protect.

Counterfactual and templated prompts

Hold the task constant; swap only protected or proxy attributes: names, pronouns, geography, dialect markers. Measure completion toxicity, sentiment, recommendation strength, and tool-call differences. Tools like BBQ, BOLD, and WinoBias provide templates; production systems need domain-specific counterfactual suites mirroring real user phrasing.

Slice metrics and continuous monitoring

Tag production traffic with coarse demographics only where consent and policy allow; otherwise use proxy slices (language variety, account tenure, product tier). Report precision, recall, escalation rate, and CSAT per slice. Alert when drift exceeds thresholds — fairness regressions often arrive via a new fine-tune, not a code deploy.

Human review and LLM-as-judge (with caution)

Frozen human-labeled audit sets anchor automated evals. LLM judges scale counterfactual runs but carry their own biases; calibrate judges against humans per LLM-as-judge best practices and never treat judge scores as legal compliance proof.

Mitigation across data, alignment, RAG, and guardrails

Fairness is a systems property. A debiased base model plus a biased retrieval index still harms users.

Data curation — upsample underrepresented domains; deduplicate stereotype-heavy sources; audit annotation guidelines for coded language.
Alignment — RLHF/ DPO with diverse preference data; rejection sampling on counterfactual failures; constitutional rules that forbid differential treatment on protected attributes.
RAG — multilingual and multi-region corpora; per-slice recall@k; block retrieval of documents with known biased policy language unless flagged for human review.
Inference guardrails — classifiers on inputs and outputs for slurs, differential-refusal patterns, and policy violations; equalized threshold tuning per slice where allocative decisions occur.
Human in the loop — mandatory review when confidence is low or slice historically underperforms; right to appeal for high-stakes outcomes.
Abstain and defer — when fairness constraints cannot be met, route to human agents instead of guessing.

Worked example: Harbor Support ticket routing fairness

Harbor Support routes incoming tickets to Tier 1 auto-reply, Tier 2 specialist, or urgent escalation using an LLM classifier over subject + first message. After launch, slice metrics show Tier 2 assignment 18% higher for messages detected as African American English (AAE) features, with equal ground-truth severity labels on a labeled audit set.

Diagnosis

Counterfactual suite: same billing dispute, AAE vs Standard American English phrasing — escalation score delta 0.31 on average.
Training data skew: 72% of “urgent” examples in SFT set used formal corporate tone; informal phrasing spuriously correlated with anger labels.
Retrieval: policy snippets on “abusive language” retrieved more often for AAE queries because chunk embeddings conflated direct tone with hostility.

Remediation shipped

Rebalanced SFT with matched-severity examples across dialect templates.
Added dialect-aware data augmentation; dropped features explicitly encoding sociolect from the routing model input (name removed; message text only).
Equalized-odds threshold adjustment on validation slice: raised auto-reply threshold for slices with historical FPR imbalance.
RAG filter: deprioritize “conduct” chunks unless toxicity classifier fires above 0.85.
Weekly slice dashboard + monthly human audit of 200 stratified tickets.

Post-fix: escalation rate gap reduced from 18% to 3% (within pre-defined tolerance); overall CSAT unchanged. Documented impossibility trade-off: slight increase in missed urgent flags on formal tone to achieve parity on false escalations.

Mitigation strategy decision table

Symptom	Likely layer	First lever	Verify with
Stereotyped open-ended completions	Pre-train / SFT	Counterfactual fine-tune; output guardrails	BOLD-style templates; human toxicity review
Unequal approval or routing rates	Classifier head / thresholds	Equalized odds threshold tuning; abstain path	Slice confusion matrices; disparate impact ratio
Region-specific wrong answers	RAG corpus	Expand corpus; per-locale retrieval eval	Recall@k by geography slice
Differential refusals (“I can’t help with that”)	Alignment / system prompt	Rewrite refusals; balanced RLHF preferences	Counterfactual refusal rate parity
Judge or metric blind spots	Evaluation	Diverse human audit set; swap judge models	Inter-rater agreement by slice
Regression after model upgrade	Deployment	Gate on fairness CI; shadow mode	Pre-prod counterfactual suite diff

Common pitfalls

Fairness washing — publishing aggregate safety scores without slice breakdowns; regulators and users read the gaps.
Proxy blindness — removing race from inputs while zip code, school name, and writing style remain strong proxies.
Metric shopping — reporting only demographic parity when equalized odds fails (or vice versa) without disclosure.
One-shot debiasing — a single fine-tune pass that fixes benchmarks but erodes under production drift.
Annotator homogeneity — RLHF raters from one region encoding politeness norms as quality.
Ignoring representational harm in allocative apps — even “neutral” scores paired with stereotyped explanations erode trust.
Over-automation — deploying LLM decisions in credit, housing, or health without human appeal where law or ethics require it.
Confusing privacy with fairness — redacting PII does not remove dialect or behavioral proxies; see privacy architecture separately.

Production checklist

Document harm model: representational, allocative, or both; tie to product decisions.
Build domain counterfactual eval set before launch; version with model releases.
Choose fairness metrics with legal/product sign-off; record impossibility trade-offs.
Audit training, RAG, and SFT data for dialect and demographic skew.
Implement slice metrics in observability; set alert thresholds.
Tune thresholds on validation slices, not global accuracy alone.
Define human escalation when parity constraints fail or confidence is low.
Run fairness regression in CI for routing/classification models.
Publish internal model cards: intended use, known limitations, slice performance.
Schedule quarterly human audits stratified by language variety and outcome.

Key takeaways

Representational harm lives in generated text; allocative harm lives in unequal outcomes — measure both.
Bias enters at every layer: pre-training, alignment, RAG, prompts, and tools.
Fairness metrics conflict by design; choose and document trade-offs.
Counterfactual evaluation and slice monitoring catch failures aggregate accuracy hides.
Mitigation is continuous: gate deploys, monitor drift, keep humans in the loop for high-stakes decisions.