Guide

Explainable AI explained

A credit model rejects an application. A fraud detector flags a transaction. A medical classifier suggests a diagnosis. In each case someone will ask why — a customer, a regulator, an appeals reviewer, or the engineer debugging a spike in false positives. Explainable AI (XAI) is the set of methods that translate model outputs into human-readable reasons: which features pushed the score up or down, for this specific input or across the whole dataset. It sits between opaque deep learning stacks and the accountability requirements of production ML. This guide covers global vs local explanations, permutation and tree-based importance, SHAP Shapley values, LIME surrogate models, vision-specific Grad-CAM, when explanations mislead, regulatory context, and a checklist for shipping interpretability alongside gradient boosting and neural models in the real world.

Why interpretability matters in production

Not every model needs a paragraph of reasoning attached to every prediction. A recommendation ranker behind a "You might also like" carousel can ship with aggregate monitoring alone. But high-stakes domains — lending, insurance, hiring, healthcare triage, content moderation with appeals — carry legal, ethical, and operational pressure to justify individual decisions.

Interpretability serves four distinct audiences:

End users and customers — "Why was I denied?" requires a plain-language summary, not a 50-feature SHAP waterfall.
Compliance and risk — auditors want evidence that protected attributes (race, gender, age proxies) are not driving outcomes through correlated features.
Engineers and data scientists — debugging requires knowing whether the model latched onto a spurious shortcut (hospital ID predicting mortality because sicker patients went to certain hospitals).
Product and policy — stakeholders decide whether to override, retrain, or retire a model based on whether its reasoning aligns with business rules.

XAI is complementary to, not a substitute for, good feature engineering and rigorous evaluation metrics. A model with perfect accuracy on a biased dataset will produce convincing-but-wrong explanations. Explanations audit the model's logic; they do not prove the data was fair.

Global vs local explanations

The first fork in any XAI project is scope:

Global explanations

Answer "what does this model care about overall?" Examples: a ranked bar chart of average feature importance across 100,000 loan applications, or a partial dependence plot showing how predicted default risk rises as debt-to-income increases. Global views guide feature selection, communicate model behavior to executives, and catch features that should never dominate (e.g., ZIP code as a race proxy).

Local explanations

Answer "why this prediction for this row?" A SHAP force plot for applicant #48291 might show: +0.12 from high utilization, +0.08 from recent delinquency, -0.05 from long account tenure. Local explanations power adverse-action notices, customer support lookups, and appeals workflows. They must be stable (similar inputs produce similar explanations) and faithful (the explanation reflects what the model actually computed, not a post-hoc story).

Most production systems need both: global monitoring dashboards for drift and fairness, plus per-decision artifacts stored alongside each prediction for audit retention.

Intrinsic vs post-hoc methods

Intrinsically interpretable models wear their logic on the surface: linear regression coefficients, logistic regression weights, shallow decision trees, GAMs (generalized additive models), and rule lists. You can read the formula. The trade-off is capacity — a linear model may underfit complex patterns that XGBoost captures easily.

Post-hoc explainers wrap any black-box model: neural nets, large ensembles, LLM classifiers. They approximate or probe the model after training. SHAP, LIME, and Grad-CAM are post-hoc. The model stays unchanged; the explainer is a separate analysis layer. This is the common path when accuracy demands complexity but regulations demand reasons.

Feature importance: permutation and tree-based

The simplest global method is permutation importance: shuffle one feature column while holding others fixed, measure how much validation metric (AUC, RMSE) degrades, repeat across folds. Features whose shuffle hurts performance most are "important." It is model-agnostic and easy to implement, but slow on wide datasets and blind to feature interactions (shuffling feature A alone may miss that A only matters when B is high).

Tree ensembles expose split-based importance natively: count how often a feature is used to split nodes, weighted by improvement in impurity or gain. XGBoost and LightGBM report feature_importances_ out of the box. Fast and useful for engineering triage, but biased toward high-cardinality features and correlated groups — if zip code and latitude both split often, importance is split arbitrarily between them.

Use permutation importance when you need model-agnostic global rankings; use split importance for quick iteration during training; graduate to SHAP when stakeholders need theoretically grounded attribution.

SHAP: Shapley values for ML

SHAP (SHapley Additive exPlanations) assigns each feature a contribution score for a given prediction, grounded in cooperative game theory. The Shapley value answers: "how much did feature X change the prediction compared to the average prediction across all possible orderings of features?" Properties that make SHAP popular:

Local accuracy — SHAP values sum to the difference between the prediction and a baseline (expected model output).
Consistency — if a model changes so feature X never hurts, X's attribution does not decrease.
Missingness — absent features get zero attribution.

Implementation variants matter for speed:

TreeSHAP — exact, fast for tree ensembles (XGBoost, LightGBM, random forests).
KernelSHAP — model-agnostic, samples coalitions of features; slower, approximate.
DeepSHAP — tailored to neural networks via backpropagation approximations.

Typical workflow: compute SHAP on a representative background sample (100–1,000 rows), store shap_values per prediction in your inference pipeline, render waterfall or force plots for appeals, and aggregate mean |SHAP| for global importance bar charts. Watch for correlated features — SHAP may spread credit arbitrarily between collinear inputs unless you group them or use hierarchical clustering on SHAP correlations.

LIME: local interpretable surrogates

LIME (Local Interpretable Model-agnostic Explanations) takes a different approach: for one instance, perturb the input (flip words, mask image patches, jitter numeric features), query the black-box model on thousands of neighbors, and fit a simple linear model weighted by proximity to the original point. The linear coefficients become the "explanation."

LIME is intuitive and works on text, images, and tabular data. Weaknesses:

Instability — different random seeds or perturbation ranges can flip sign on borderline features.
Fidelity gap — the local linear surrogate may not match the true decision boundary in high-curvature regions.
Discrete perturbations — for text, removing a word changes semantics; for tabular, unrealistic combinations may lie far from the training manifold.

In practice, SHAP has largely superseded LIME for tabular production use because of better theoretical guarantees and TreeSHAP performance. LIME remains useful for quick prototyping, teaching, and certain NLP/image demos where perturbation is natural.

Vision and language-specific explainers

Grad-CAM and saliency maps

For convolutional image classifiers, Grad-CAM (Gradient-weighted Class Activation Mapping) highlights which spatial regions influenced the predicted class by backpropagating gradients to the final convolutional feature maps. Heatmaps overlay on the input image — useful for radiology QA ("did the model look at the lesion or the scanner artifact?"). Extensions like Grad-CAM++ improve localization on multiple instances per image.

Attention visualization

Transformer models expose attention weight matrices. Visualizing which tokens attend to which can suggest what an LLM "focused on" — but attention weights are not guaranteed to equal feature importance (research shows attention and gradient-based attributions can disagree). Treat attention maps as hypotheses, not ground truth.

Token attribution for NLP

Integrated gradients, SHAP on embeddings, and LIME with word removal assign per-token scores for classification or toxicity detection. Essential for content moderation appeals where a user challenges which phrase triggered a flag.

When explanations lie — and how to catch it

Explanations are models of models. They can mislead:

Spurious correlations — a pneumonia detector may highlight "portable chest X-ray" markings because training data correlated device type with severity, not because the device caused the diagnosis.
Adversarial fragility — tiny input changes can flip predictions while explanations stay superficially similar.
Fairness proxies — zip code, school name, or browser language may encode protected classes; low direct attribution on "race" does not mean the model is fair.
Explanation cherry-picking — showing only favorable local plots while global SHAP reveals a problematic dominant feature.

Mitigations: run global and subgroup analyses (slice SHAP by demographic cohort), compare multiple explainers and flag disagreement, use proper holdout sets so explanations are not tuned on test data, and pair XAI with drift monitoring so importance rankings are re-checked as distributions shift.

Shipping XAI in an inference pipeline

Explanations belong in the same artifact store as predictions:

At training time — compute global SHAP on validation data; document top features and subgroup slices; reject models where forbidden features rank high.
At inference time — for high-stakes rows, compute TreeSHAP or precomputed lookup tables; budget latency (TreeSHAP on 50 trees × 20 features is milliseconds; KernelSHAP on wide nets is not).
Storage — persist JSON blobs: {prediction, baseline, shap_values, top_k_features, model_version, explainer_version} with retention matching regulatory requirements.
Human layer — map raw feature names to customer-facing copy ("high credit utilization" not util_ratio_90d); never expose internal proxy features that reveal sensitive attributes.
Monitoring — track distribution of top attributions over time; alert when a feature's mean |SHAP| jumps — often a leading indicator of concept drift.

Production checklist

Classify each model endpoint by stakes — XAI required, optional, or dashboard-only.
Choose global method (permutation SHAP mean, partial dependence) and local method (TreeSHAP, KernelSHAP) per model family.
Fix a background dataset for baselines; version it with the model.
Run subgroup fairness slices on top global and local attributions.
Validate explanations on known counterfactual cases (flip one feature, check attribution sign).
Store per-decision explanation artifacts with model and explainer version IDs.
Build customer-facing adverse-action templates from top-k positive attributions only.
Monitor mean |SHAP| per feature weekly; tie alerts to retraining triggers.
Document known limitations (correlated features, surrogate fidelity) for auditors.
Re-run explanation audits after every material retrain or feature change.

Key takeaways

Global tells you what the model uses; local tells you why this row — production needs both.
SHAP is the default for tabular post-hoc XAI — TreeSHAP is fast and theoretically grounded; use KernelSHAP when the model is not tree-based.
LIME is a teaching and prototyping tool — instability limits regulatory-grade use without careful validation.
Explanations do not prove fairness — audit subgroups, proxies, and data lineage alongside attributions.
Treat explainers as versioned pipeline components — store artifacts, monitor attribution drift, and re-audit after retrains.