Guide
LLM agent webhook and async job queue systems explained
Harbor Integrations sold “event-driven support agents”: when a Zendesk ticket updated, a webhook fired and an LLM agent drafted a reply, checked refund policy, and sometimes called Stripe. The first version handled webhooks synchronously inside the HTTP request — no queue, no idempotency store. During a Zendesk outage recovery, the vendor replayed three days of ticket events. Harbor's handlers ran each event twice; agents sent 1,400 duplicate customer emails and initiated 41% duplicate side-effect tool calls in a four-hour window before an engineer killed the deployment. After moving to signed ingress, deduplicated job queues, and durable run records, duplicate side effects fell to 3% (almost entirely third-party retries outside the dedup window).
Webhook ingress is how external systems trigger agent work. Async job queues decouple “we received an event” from “we executed a multi-minute agent run with tools.” Together they define reliability, fairness, and tenant safety for event-driven agents. This guide covers verification, idempotency, queue topology, worker tenant binding, scheduling, integration with rate limits and cancellation, the Harbor refactor, a technique decision table, pitfalls, and a production checklist.
Why synchronous webhook handlers fail for agents
A CRUD webhook handler can often finish in milliseconds: validate signature, upsert a row, return 200. Agent runs are different:
- Duration — tool loops routinely take 30 seconds to several minutes; most SaaS webhooks time out at 10–30 seconds.
- Retries — providers retry on slow or ambiguous responses; without dedup, retries become duplicate runs.
- Bursts — Monday-morning ticket floods or GitHub
pushstorms can exceed orchestrator concurrency. - Partial failure — the agent may send email (irreversible) then crash before responding 200; the provider retries and sends again.
- Human-in-the-loop — some events should enqueue a draft, not auto-execute writes.
The correct pattern is ack fast, work async: verify the webhook, persist an idempotent job, return 2xx immediately, process on workers with full tracing and checkpointing.
Webhook ingress: verify, normalize, enqueue
Signature verification
Every ingress route must validate provider signatures (HMAC-SHA256, Ed25519, or vendor-specific schemes) using a per-tenant secret from your secret broker. Reject before parsing large bodies. Rotate secrets with dual-key acceptance windows.
Event normalization
Map vendor payloads to an internal envelope:
{
"event_id": "evt_zd_88421",
"source": "zendesk",
"tenant_id": "tnt_8f2a…",
"event_type": "ticket.updated",
"occurred_at": "2026-06-12T09:14:22Z",
"payload_hash": "sha256:…",
"dedup_key": "zendesk:ticket:99102:rev:44"
}
Store raw payload bytes (encrypted) for replay debugging, but agents should consume normalized fields so tool code does not depend on Zendesk JSON shape.
Fast ACK contract
HTTP handler responsibilities only: auth, dedup check, insert job row, publish to queue, return 200/202. Never call the model inside the request thread. If enqueue fails, return 5xx so the provider retries — but only after the idempotency row prevents double enqueue on retry.
Idempotency and deduplication
Idempotency is the difference between Harbor's 41% duplicate tool calls and 3%. Implement at two layers:
- Ingress dedup — unique constraint on
(tenant_id, dedup_key)or(tenant_id, event_id). Second delivery returns 200 with the existingjob_id. - Run-level idempotency — write tools accept
Idempotency-Keyheaders derived fromjob_idper retry guidance. - Side-effect ledger — before sending email or charging cards, check a ledger: “already executed action X for job Y.”
Choose dedup keys carefully. ticket_id alone collapses distinct
updates; ticket_id + revision or vendor event_id
is safer. Document TTL: some providers replay after 72 hours — Harbor
extended dedup retention from 24 hours to 7 days.
Job queue topology for agent workloads
Generic task queues work, but agent jobs have unique needs:
Priority and fairness
Separate queues or weighted priorities: interactive user chat > webhook automations > batch backfills. Within webhooks, apply per-tenant fair queuing so one customer's GitHub monorepo cannot starve others.
Concurrency caps
Limit concurrent runs per tenant, per integration, and globally. A spike in
issue.opened events should queue, not spawn 500 sandboxes.
Poison messages and DLQ
After N failures with exponential backoff, move jobs to a dead-letter queue with the last error, payload snapshot, and link to traces. Operators need a “replay DLQ job” button that creates a new job with a fresh idempotency scope if side effects already partially applied.
Delayed and scheduled jobs
Some automations should wait: “if no human reply in 4 hours, agent
follows up.” Use a scheduler tier (cron, delayed SQS, or time-wheel) that
enqueues standard jobs at run_at. Persist scheduled jobs in the
same database as
checkpoints
so deploys do not lose timers.
Tenant binding on async workers
The highest-risk bug class after dedup failures is processing a job without correct tenant scope — the same failure mode as in multi-tenant isolation. Rules:
- tenant_id lives on the job row, set at ingress from the webhook routing table (URL path, subdomain, or signed claim) — never from unauthenticated payload fields alone.
- Workers rehydrate tenant context before any model call, tool dispatch, or memory read.
- Queue messages are opaque IDs pointing to DB rows, not full payloads with embeddable tenant overrides.
- Cross-tenant integration tests enqueue jobs for two tenants with colliding external IDs; assert zero cross-reads.
Worker execution loop
A typical worker cycle:
- Lease job with visibility timeout (extend heartbeat during long runs).
- Create
run_id; load tenant context and integration credentials. - Execute agent graph with middleware hooks for policy and redaction.
- Checkpoint state after each tool per durable-state design.
- Mark job
succeededorfailed; release lease. - Optionally POST a callback URL if the integration registered one.
Integrate cancellation: if the ticket was solved while the job queued, a lightweight pre-flight tool should no-op the run before sending customer email.
For user-visible progress, workers can publish events to
SSE channels
keyed by run_id even though the trigger was a webhook.
Harbor Integrations refactor walkthrough
Harbor's remediation sprint:
- Ingress service — dedicated pods; only verify, dedup, enqueue; p99 < 80 ms.
- Postgres job table — unique on
(tenant_id, dedup_key); status machinepending → running → succeeded|failed|cancelled. - SQS + per-tenant rate tokens — dequeue only when tenant concurrency budget allows.
- Side-effect ledger — email and Stripe tools check ledger before execute; retries become no-ops.
- 7-day dedup window — covers vendor replay storms.
- DLQ dashboard — support replays with one click and mandatory incident note.
Mean time from ticket event to first agent draft increased by 4 seconds (queue wait) but customer-facing duplicate emails dropped by 93%. Zendesk webhook timeout errors went from hundreds per day to zero.
Technique decision table
| Approach | Latency to start work | Duplicate risk | When to use |
|---|---|---|---|
| Synchronous HTTP handler calls agent inline | Lowest | Very high | Prototypes only; no write tools |
| Enqueue on ingress + single shared worker pool | Low (+queue wait) | Low with dedup | Default for SaaS webhook agents |
| Per-tenant queues + fair scheduling | Low | Low | Multi-tenant platforms with bursty integrators |
| Event bus (Kafka/NATS) + stream processors | Medium | Medium (needs consumer idempotency) | High volume, many event types, analytics fan-out |
| Workflow engine (Temporal, Step Functions) | Medium | Very low | Long-running, multi-day agent workflows with timers |
Most teams should start with a relational job table plus a managed queue. Adopt Temporal when you have many delayed steps, human approvals, and compensating transactions across days.
Common pitfalls
- 200 before enqueue commits — crash between ACK and persist loses events; use transactional outbox pattern.
- Weak dedup keys — collapsing distinct events causes missed automations; over-broad keys cause duplicates.
- Calling write tools before checkpoint — retry after crash duplicates side effects; ledger first.
- Unbounded webhook body parsing — large GitHub payloads can OOM ingress; stream-parse or size-cap.
- Missing tenant on worker dequeue — trusting payload
orgfields without routing-table lookup. - No visibility timeout extension — long runs requeued mid-flight, doubling work.
- Ignoring provider replay headers — some vendors send
X-Request-Id; store it even if you have custom dedup keys.
Production checklist
- Webhook ingress verifies signatures with per-tenant secrets; rejects unsigned requests.
- Handler returns 2xx only after idempotent job row + queue publish succeed (outbox).
- Unique constraint on
(tenant_id, dedup_key)or vendorevent_id. - Workers lease jobs with heartbeats; visibility timeout extends during long runs.
- tenant_id set at ingress from authenticated routing; rehydrated on every worker step.
- Write tools use idempotency keys tied to
job_id; side-effect ledger enforced. - Per-tenant and global concurrency caps integrated with rate-limit middleware.
- DLQ with operator replay; scheduled/delayed jobs survive deploys.
- Pre-flight cancellation checks stale events before irreversible tools.
- Traces link
event_id → job_id → run_idfor end-to-end debugging.
Key takeaways
- Never run agents inside webhook HTTP threads — ack fast, queue work, return 2xx.
- Idempotency is non-negotiable — providers will retry; your design must welcome retries safely.
- Jobs carry tenant scope — async boundaries are where multi-tenant leaks happen.
- Side-effect ledgers beat hope — assume workers crash mid-run.
- Harbor cut duplicate side effects from 41% to 3% with dedup keys, outbox enqueue, and ledgers — not by disabling webhooks.
Related reading
- LLM agent durable state, checkpointing and run persistence explained — survive worker crashes mid-run
- LLM agent retry, backoff and transient failure recovery explained — safe retries for queued jobs
- LLM agent multi-tenancy and tenant isolation systems explained — tenant context on async workers
- LLM agent middleware hook pipeline explained — ingress and worker policy hooks