Guide

LLM agent dead letter queue and poison message handling systems explained

Harbor Compliance sold automated KYC review: when a customer uploaded a passport scan, a webhook enqueued an agent job to extract fields, check sanctions lists, and open a case in their CRM. One integrator shipped a batch of PDFs with a corrupted XMP metadata block that crashed the document parser on every attempt. The worker retried with exponential backoff but never classified the failure as permanent. That single poison job consumed retry slots and visibility timeouts until 890 legitimate onboarding jobs sat stalled behind it for six hours. Mean time to first agent action for healthy uploads rose from 12 seconds to 41 minutes; support tickets spiked 38%. After Harbor added agent-aware dead letter queues (DLQs), failure taxonomy, and governed replay, poison-induced backlog incidents fell to 2% of queue outages (down from 61%).

A dead letter queue is where jobs land after repeated failure so they cannot block live traffic — but agent workloads are not ordinary messages. Runs may have partially executed write tools, consumed thousands of tokens, and left durable checkpoints. This guide covers why generic DLQ patterns are insufficient, how to classify agent failures, what metadata to capture, safe operator replay, integration with retry policy and tracing, the Harbor refactor, a technique decision table, pitfalls, and a production checklist.

Why agent poison messages are worse than CRUD poison pills

In a typical microservice, a poison message might fail JSON parsing ten times and waste a few milliseconds per attempt. Agent jobs differ:

Cost — each retry may invoke the model, run sandboxes, and call paid APIs for 30–120 seconds.
Partial side effects — the agent may have created a CRM case or sent a Slack alert before crashing on step four.
Durable state — checkpoints and WAL entries survive retries; blind replay can skip or duplicate steps.
Tenant blast radius — one poison job tied to a shared worker pool can delay every customer in the queue.
Ambiguous errors — a 503 from a tool looks transient but may be a permanent misconfiguration if the API key scope is wrong.

Agent platforms need DLQs that carry run context, not just the raw queue payload. Operators must answer: what failed, what already succeeded, and is replay safe?

Failure taxonomy: retry, fail fast, or DLQ

Before configuring max receive counts, classify errors at the worker boundary:

Transient (retry with backoff)

Model provider 429/503 with retry-after headers
Network timeouts to known-good endpoints
Database deadlock or serialization failure
Worker OOM kill mid-run (replay from last checkpoint)

Permanent (fail fast to DLQ)

Schema validation: missing required fields in normalized event envelope
Auth errors: 401/403 from tools that will not self-heal
Unsupported file type or corrupt document after two parse attempts
Policy blocks: agent graph rejects action before any write tool

Poison (immediate DLQ + circuit)

Same error signature N times in a row with zero progress in checkpoint index
Infinite tool loop detected by step budget middleware
Deterministic crash (e.g. UnicodeDecodeError) on identical input hash

Encode classification in middleware hooks per hook pipeline design. Return structured errors: { "kind": "permanent", "code": "DOC_PARSE_CORRUPT", "retryable": false } so the queue layer does not guess.

Agent DLQ record: what to capture

When a job moves to DLQ, persist an operator-facing record beyond the broker's default dead-letter envelope:

{
  "dlq_id": "dlq_7c91…",
  "job_id": "job_44fa…",
  "run_id": "run_9b02…",
  "tenant_id": "tnt_8f2a…",
  "source": "webhook:kyc.uploaded",
  "attempt_count": 6,
  "last_error": { "kind": "permanent", "code": "DOC_PARSE_CORRUPT" },
  "checkpoint_step": 2,
  "side_effects_applied": ["crm.case_created:case_881"],
  "token_spend_usd": 0.47,
  "trace_url": "https://…/traces/run_9b02",
  "payload_snapshot_ref": "s3://…/encrypted",
  "moved_at": "2026-06-12T14:22:01Z"
}

Link job_id → run_id → trace_id so on-call engineers jump from a DLQ row to the exact tool span that failed. Store encrypted payload snapshots for replay, but redact PII in the DLQ UI per tenant policy.

Retry budgets and max receive counts

Generic queues use maxReceiveCount (SQS) or x-death headers (RabbitMQ). Agent platforms should add run-level budgets:

Attempt cap — e.g. 5 worker leases per job_id; increment only when checkpoint index does not advance.
Token budget — stop retrying after $X spend on the same job unless operator approves.
Time window — if first failure was 24 hours ago and error is unchanged, DLQ instead of retry 47.
Global poison circuit — if >10% of jobs in a tenant queue hit the same error code in 5 minutes, pause dequeue and page on-call (likely bad deploy or integration change).

Align with rate limits: retries should not bypass tenant concurrency caps or flood a failing vendor.

Partial failure and the side-effect ledger

The hardest DLQ question: “Can we replay from the start?” If step two created a CRM case, a naive replay may create a duplicate. Harbor's fix:

Side-effect ledger keyed by (job_id, action_type, idempotency_key) — same pattern as webhook ingress.
Checkpoint-aware replay — operator chooses “resume from step 3” vs “new run with fresh idempotency scope.”
Compensating actions — for irreversible tools, document rollback or human cleanup before DLQ replay; see saga rollbacks.

DLQ UI must show side_effects_applied prominently. Replays without reading that list caused Harbor's remaining 2% of poison incidents.

Operator replay workflows

Safe replay (default)

Create a new job_id with a lineage pointer to the DLQ row. Copy normalized payload; apply fixed parser version or corrected credentials. Write tools inherit new idempotency keys; ledger prevents duplicates if old side effects overlap.

Resume from checkpoint

When failure was transient after partial progress (tool timeout on step 5), resume from checkpoint with the same run_id only if lease semantics guarantee no concurrent worker holds the run.

Discard with audit

Some poison jobs should never replay (test data, malicious payload). Mark discarded with mandatory reason; notify tenant if their integration sent bad events.

Bulk redrive guardrails

“Replay all 200 DLQ rows” buttons are dangerous after a deploy fix. Require: error code filter, max batch size, canary replay of 5 jobs with metric watch, then bulk. Log every replay to the audit trail.

Harbor Compliance refactor walkthrough

Harbor's remediation after the six-hour backlog:

Parser hardening — corrupt XMP now returns DOC_PARSE_CORRUPT permanent error on first attempt, not stack trace retry loop.
Separate DLQ per job type — KYC, sanctions, and webhook backfills isolated; poison in one lane cannot block others.
Stuck-job detector — cron flags jobs with attempt_count > 3 and checkpoint_step == 0.
DLQ dashboard — side effects, token spend, trace link, one-click safe replay.
Tenant notification webhook — when >5 jobs DLQ in an hour for one tenant, POST to their ops endpoint with sample error codes.
Poison circuit breaker — pause tenant dequeue when error rate spikes; auto-resume after cooldown or manual ack.

P95 queue wait for healthy KYC jobs returned to 14 seconds. Support tickets tied to “stuck onboarding” dropped 34% quarter over quarter.

Technique decision table

Approach	Poison isolation	Replay safety	When to use
Infinite retry with backoff only	None	High duplicate risk	Never in production agent queues
Broker DLQ (max receive count)	Good	Low without ledger	Read-only agent tasks, no write tools
Agent-aware DLQ + failure taxonomy	Very good	Medium	Default for webhook-triggered agents
DLQ + side-effect ledger + checkpoint resume	Very good	High	Multi-step agents with CRM, email, payments
Per-tenant DLQ + circuit breaker	Excellent	High	Multi-tenant SaaS with bursty integrators

Start with classified errors and a rich DLQ record. Add per-tenant circuits when a single customer can enqueue thousands of events per hour.

Common pitfalls

DLQ without operator UI — jobs vanish into a topic nobody monitors; backlog looks healthy while customers wait.
Retrying permanent errors — burning tokens on a 401 that will never succeed.
Replaying from step zero after partial writes — duplicate CRM cases and duplicate charges.
Shared DLQ across tenants — complicates access control and replay scoping.
Missing trace links — engineers grep logs for hours instead of opening the failing span.
Bulk redrive after deploy without canary — fixed parser replays 500 jobs that already partially succeeded.
DLQ retention too short — compliance needs 90-day failure audit; broker default 14 days loses evidence.

Production checklist

Worker middleware classifies errors into transient, permanent, and poison with structured codes.
Max attempts and token budgets enforced per job_id; checkpoint progress gates retry counting.
DLQ records include job_id, run_id, tenant_id, checkpoint_step, side_effects_applied, trace_url.
Side-effect ledger checked before any write tool; replays use fresh or scoped idempotency keys.
Operator UI supports safe replay, checkpoint resume, and discard with mandatory audit reason.
Per-job-type or per-tenant DLQs prevent cross-lane poison blocking.
Poison circuit breaker pauses dequeue when error rate spikes; alerts on-call.
Bulk redrive requires filtered canary batch and metric watch.
DLQ retention meets compliance; PII redacted in UI, encrypted snapshots for replay.
Metrics: dlq_depth, poison_rate, retry_token_spend, time_to_dlq, replay_success_rate.

Key takeaways

Agent poison jobs are expensive — each retry can cost dollars and partial writes.
Classify failures explicitly — do not let the broker retry permanent errors.
DLQ records need run context — checkpoint step and side effects determine replay safety.
Isolate lanes — one bad PDF must not stall 890 onboarding jobs.
Harbor cut poison-induced outages from 61% to 2% with taxonomy, rich DLQ, and governed replay — not by disabling retries entirely.