Guide

LLM bias and fairness explained

A résumé-screening assistant ranks two identical candidates differently because one name sounds stereotypically male and the other female. A medical chatbot suggests lower pain-tolerance language when the patient’s described accent changes but symptoms stay the same. A customer-support router escalates angry messages written in African American English to “abuse” queues more often than identical tone in Standard American English. These are not edge cases — they are predictable outcomes when large language models (LLMs) trained on skewed corpora make decisions that affect people differently by demographic proxy. Bias in LLMs is not one bug; it is a stack of correlated failures: who appears in training data, which associations get reinforced during alignment, how prompts frame tasks, and whether production systems measure disparate impact at all. This guide covers where bias enters the pipeline, types of harm from representational stereotypes to allocative unfairness, fairness metrics and counterfactual evaluation, mitigation across pre-training data, fine-tuning, RAG, and guardrails, a Harbor Support ticket-routing fairness worked example, a mitigation decision table, common pitfalls, and a production checklist. For alignment training see RLHF explained; for runtime safety rails see LLM guardrails and constitutional AI.

Where bias enters the LLM stack

Fairness work starts by mapping when skew becomes behavior. Four layers dominate production incidents:

  • Pre-training corpus — web text over-represents English, Western viewpoints, and dominant demographic groups. Rare names, dialects, and professions appear as thin tails the model fills with stereotypes.
  • Supervised fine-tuning (SFT) — demonstration datasets reflect annotator demographics and task framing. A “helpful assistant” persona can over-refuse topics affecting marginalized groups if safety examples are blunt.
  • Preference alignment (RLHF, DPO, RLAIF) — reward models learn what raters prefer, not what is fair. Majority raters may penalize direct discussion of discrimination, pushing models toward euphemism or silence.
  • Application layer — prompts, RAG corpora, tool policies, and routing thresholds turn statistical bias into user-visible harm. A 2% difference in toxicity scores across dialects becomes a 40% difference in escalations if the threshold is tight.

Bias vs variance (do not conflate)

Machine-learning textbooks use “bias” for underfitting error. Fairness practitioners use it for systematic disadvantage toward protected or proxy-protected groups. This guide uses the fairness sense. Statistical bias-variance tradeoffs from bias-variance tradeoff are unrelated unless a team mislabels fairness regressions as generic overfitting.

Types of harm: representation, stereotyping, and allocation

Not every biased completion is equally harmful. Taxonomies help prioritize evals and mitigations:

Representational harm

Models depict groups inaccurately or in limited roles — associating certain nationalities with crime, genders with domestic roles, or disabilities with incompetence. Harm is reputational and cumulative even when no single user is denied a service.

Stereotype reinforcement

Prompts like “The nurse walked in and…” elicit gender-skewed continuations. Counterfactual template tests swap protected attributes (name, pronoun, ethnicity cue) and measure log-prob or generation differences. Large gaps signal stereotype leakage.

Allocative harm

Model outputs change access to resources: loan advice tone, hiring scores, moderation strikes, insurance triage suggestions, or support priority. Allocative harm demands group metrics, not only aggregate accuracy — a model can be 95% accurate overall while failing one subgroup systematically.

Intersectional effects

Fairness along gender and race separately can hide harm at intersections (e.g., Black women). Slice evals across cross-products where sample size permits; bootstrap confidence intervals when counts are small.

Fairness metrics that survive production

No single metric captures fairness. Pick metrics tied to your decision type:

Demographic parity (statistical parity)

Positive outcome rate should be similar across groups: P(ŷ=1|A=a) ≈ P(ŷ=1|A=b). Simple to audit but ignores legitimate base-rate differences and can conflict with accuracy when prevalence differs.

Equalized odds

Equal true-positive and false-positive rates across groups. Better for classification (fraud, toxicity, escalation) when mistakes in either direction carry cost.

Calibration within groups

When scores represent probability (risk, confidence), calibration curves should align per group: among users scored 0.8, ~80% should truly belong to the positive class regardless of group.

Counterfactual fairness (LLM-specific)

Hold task content constant; vary only protected-attribute cues in names, pronouns, dialect markers, or locale. Measure change in label, score, toxicity flag, or refusal rate. This is the workhorse for generative systems without formal classifiers.

Document which groups you can measure ethically and legally. Many jurisdictions restrict collecting sensitive attributes; proxy audits via synthetic personas are common but must be labeled as such in reports.

Evaluation: benchmarks, slices, and red teams

Aggregate benchmarks hide disparate impact. Build a three-layer eval program:

  • Public bias suites — BBQ, BOLD, WinoBias-style pronoun resolution, and stereotype questionnaires provide regression baselines. Track deltas per model version.
  • Application-specific counterfactual sets — 200–500 templated prompts mirroring real tasks (support tone, medical intake, code review) with attribute swaps. Store expected invariance tolerances per task.
  • Adversarial red teamingLLM red teaming probes jailbreaks that elicit slurs or discriminatory reasoning chains. Fairness failures often appear only under stress prompts.

Human review design

Raters need rubrics anchored on equal treatment, not politeness alone. Rotate raters across demographic slices; measure inter-rater disagreement separately per slice. Use LLM-as-judge only with bias-mitigated judges (swap-order pairwise, anchor examples) and human calibration on sensitive slices.

Mitigation strategies across the lifecycle

Data-centric fixes

Upsample underrepresented dialects and domains; deduplicate toxic slur clusters that overweight hate associations; add counter-stereotypical examples in SFT (e.g., male nurse, female engineer in neutral contexts). Data fixes are slow but durable.

Alignment and constitution layers

Constitutional AI principles can explicitly require equal respect, refusal of discriminatory requests, and neutral tone across names and dialects. Pair with DPO preference pairs that reward fair completions over stereotypical ones.

RAG and retrieval fairness

Retrieval corpora may omit policies written for non-English speakers or regional regulations. Audit chunk coverage by locale; add metadata filters so support bots retrieve jurisdiction-appropriate answers. Poisoned or outdated HR docs are a common allocative failure mode.

Guardrails and post-processing

Classifier sidecars can block slurs and flag disparate escalation scores. Guardrails should log slice metrics, not only block rates. Avoid hard-coded name blocklists — they create false positives on legitimate cultural names.

Product and policy choices

The highest-leverage mitigation is sometimes not automating a high-stakes decision. Human review for adverse actions, transparent appeals, and user-visible confidence scores reduce harm when models remain imperfect.

Worked example: Harbor Support ticket routing fairness

Harbor Support routes incoming tickets to auto-reply, human tier-1, or trust-and-safety escalation using an LLM scorer for urgency, toxicity, and intent. Early rollout showed escalation rates 3.2× higher for messages containing African American English (AAE) lexical markers than for Standard American English (SAE) paraphrases of the same complaint.

Diagnosis

  • Toxicity model trained on social media over-flagged AAE as hostile.
  • Prompt included “professional tone” without dialect guidance.
  • Threshold 0.72 on toxicity logit was tuned on aggregate dev set with 91% SAE.

Interventions

  • Built 400-pair counterfactual eval (SAE/AAE) from real ticket templates; target <5% escalation rate delta.
  • Fine-tuned toxicity head with dialect-balanced labels; equalized odds improved FPR gap from 18% to 4%.
  • Added constitution clause: “Do not treat dialect as hostility; score intent and threats only.”
  • Lowered escalation threshold only after slice calibration; auto-reply band widened for neutral frustration.
  • Weekly dashboard: escalation rate, auto-reply satisfaction, and counterfactual delta per locale slice.

Result: overall escalation down 11% (fewer false positives), AAE/SAE gap under 2%, human tier-1 load stable. Documented residual risk in model card for legal review.

Mitigation decision table

Symptom Likely layer First move
Stereotypical role assignments in open generation Pre-training + SFT Counterfactual template eval; counter-stereotypical SFT examples
Different refusal rates by group on benign topics Alignment / safety tuning Audit safety preference data; constitutional principles for equitable refusal
Wrong policy answers by region or language RAG corpus Locale metadata, coverage audit, human translation QA on policy docs
Classifier false positives on dialect or names Application scorer Dialect-balanced training; equalized odds tuning; raise threshold with slice metrics
High-stakes ranking varies by proxy attribute Product design Remove automated ranking or require human sign-off; log counterfactual audits

Common pitfalls

  • Fairness washing — publishing a BBQ score without application-specific slice evals.
  • Single-axis metrics — gender-only dashboards that miss race or disability interactions.
  • Erasure via over-refusal — blocking all mention of demographic topics harms users seeking legitimate help.
  • Synthetic persona overclaim — counterfactual names do not capture lived experience; complement with participatory review.
  • Threshold tuning on majority slice — global ROC optima hide minority FPR spikes.
  • Ignoring feedback loops — escalated users get worse service, generating angrier follow-ups that confirm bias.
  • Legal collection mistakes — storing sensitive attributes without consent blocks compliant fairness measurement.
  • One-shot debias prompt — “Be unbiased” in system prompt rarely moves allocative metrics measurably.

Production checklist

  • Classify each model output as representational, stereotyping, or allocative risk.
  • Build 200+ counterfactual prompts from real task templates with attribute swaps.
  • Define acceptable delta thresholds per metric before launch (document exceptions).
  • Run public bias suites plus app-specific slices on every model promotion.
  • Calibrate LLM judges with swap-order pairwise and anchor rubrics on sensitive slices.
  • Audit RAG corpora for locale, language, and policy coverage gaps.
  • Dashboard escalation, refusal, and satisfaction rates by slice weekly.
  • Publish an internal model card: training data limits, eval gaps, residual risks.
  • Establish human appeal path for any adverse automated action.
  • Re-run fairness evals after prompt, threshold, or retrieval changes — not only weight updates.

Key takeaways

  • Bias is stack-shaped — data, alignment, retrieval, and thresholds compound.
  • Counterfactual evals are the generative-model equivalent of subgroup metrics.
  • Allocative harm requires equalized odds thinking, not accuracy alone.
  • Mitigation mixes data, alignment, guardrails, and product policy — no silver prompt.
  • Measure continuously; fairness regressions ship silently without slice dashboards.

Related reading