Guide
LLM agent scheduled cron job trigger systems explained
Harbor Finance wired a month-end close agent to run at
0 2 1 * * in America/New_York — reconcile ledgers,
draft variance commentary, and post accrual journals through approved
write tools. The first production month looked fine. November broke:
daylight saving time ended on a Sunday, the scheduler fired the job
twice in the ambiguous 1:00–2:00 AM hour, and a mid-run deploy
restarted the worker process, which re-enqueued the same tick because
no overlap lock existed. Finance closed the books with
duplicate accrual entries on 19% of subsidiary ledgers
before anyone traced the bug to cron semantics, not model hallucination.
Scheduled cron triggers are how agents run without a human in the loop — nightly reports, hourly health scans, compliance attestations, and data-pipeline maintenance. They look simpler than webhook ingress but fail in subtler ways: timezones, missed ticks, overlapping long runs, and duplicate enqueue on process restart. This guide covers schedule registries, cron vs interval triggers, timezone and DST policy, missed-window catch-up rules, per-tick idempotency and overlap guards, integration with durable checkpoints, the Harbor Finance refactor, a technique decision table, pitfalls, and a production checklist alongside dead letter handling for poisoned scheduled jobs.
When cron beats webhooks and when it does not
Cron triggers answer: “run this agent at predictable wall-clock times.” Webhooks answer: “run when an external system pushes an event.” Most production platforms need both.
- Cron fits — batch reconciliation, scheduled digests, SLA scans, cache warming, model evaluation sweeps, and any job whose inputs are time-bounded (“all transactions since last successful tick”).
- Webhooks fit — ticket created, PR opened, payment succeeded. Latency-sensitive, event-shaped payloads.
- Hybrid — cron enqueues a sweep; individual records still arrive via webhook and dedupe against the same idempotency ledger.
The mistake Harbor made was treating cron as “just call the agent” without the same durability guarantees as webhook ingress: signature verification becomes schedule-version verification; dedup becomes per-tick idempotency keys; the queue topology is identical.
Schedule registry: versioned, tenant-scoped definitions
Store schedules as data, not hard-coded crontab lines scattered across services. A minimal registry row:
{
"schedule_id": "harbor-close-monthly",
"agent_id": "ledger-close-v3",
"cron": "0 2 1 * *",
"timezone": "America/New_York",
"enabled": true,
"version": 7,
"missed_policy": "skip",
"overlap_policy": "forbid",
"tenant_id": "harbor-finance-prod"
}
Every deploy that changes prompt, tools, or cron expression bumps
version. Workers embed
schedule_id + version + tick_timestamp in the
idempotency key so a restarted process cannot double-fire the same
logical run. Multi-tenant schedulers must bind
tenant_id on the worker before any tool executes —
see
tenant isolation
for credential scoping.
Support three trigger shapes in one registry:
- Cron expression — calendar semantics with timezone (use a library that understands IANA zones, not fixed UTC offsets).
- Fixed interval — every N minutes from last successful completion (not from enqueue) for jobs where wall-clock alignment matters less than spacing.
- One-shot — run at
run_attimestamp for delayed human approvals or maintenance windows.
Timezone, DST, and the ambiguous hour
Cron bugs cluster around daylight saving transitions. Rules Harbor now enforces:
- Never store schedules as raw UTC offsets — always IANA timezone strings.
- Spring forward (missing hour) — skip ticks
that fall in nonexistent local times; log
SKIPPED_DST_GAP. - Fall back (repeated hour) — fire at most once per logical tick; key idempotency on UTC instant, not local clock label. Harbor’s duplicate close used local hour as key and fired twice.
- Document tenant preference — some finance teams want “2 AM local always”; others want “09:00 UTC globally.” Make the choice explicit in the registry UI.
Run a quarterly test suite that simulates DST edges for every active timezone in the registry. This is cheaper than a month-end restatement.
Missed-window policy: catch-up vs skip
Schedulers pause during outages. When the worker returns, the cron library may offer to fire all missed ticks. For agent jobs with write tools, that is dangerous.
| Policy | Behavior | Use when |
|---|---|---|
| Skip | Missed ticks are logged and dropped; next fire is the next valid cron match. | Month-end close, daily digest, any job where duplicate logical periods cause harm. |
| Coalesce | At most one catch-up run with
backfill_from=last_success_tick in the payload. |
Metrics aggregation, read-only scans, idempotent sweeps. |
| Replay each | Enqueue one job per missed tick in order. | Rare; only when each interval is independent and writes are keyed by interval ID. |
Harbor Finance moved month-end close to skip with manual “run now” for operators. Hourly read-only anomaly scans use coalesce so a 20-minute outage does not leave a gap in coverage.
Overlap guards and long-running agents
Agent runs can exceed the cron period — a 90-minute close job on an hourly schedule. Without overlap policy, ticks stack until the queue saturates.
- forbid — if a run for
schedule_idis active, drop or defer the new tick; emit metriccron_overlap_skipped. - allow_parallel — only for read-only agents with no shared mutable state; still cap concurrency per schedule.
- cancel_previous — almost never appropriate for agents mid-tool-chain; prefer forbid.
Persist run state in checkpoints so a deploy mid-run resumes instead of restarting from scratch — Harbor’s second duplicate came from a fresh enqueue on boot rather than resume.
Per-tick idempotency and the side-effect ledger
Every scheduled enqueue should carry:
idempotency_key = hash(schedule_id, version, tick_utc_instant)
payload = { "period_start": "...", "period_end": "...", "trigger": "cron" }
The worker checks a side-effect ledger before executing write tools.
If idempotency_key already reached
COMPLETED, return cached summary. If
IN_PROGRESS beyond lease TTL, escalate to
DLQ triage
instead of starting a parallel run.
Pair with retry policy: cron jobs should not blindly retry write tools on ambiguous timeout. The next cron tick is not a safe retry if the period boundary overlaps.
Harbor Finance refactor
Engineering shipped four changes after the duplicate close incident:
- UTC-based idempotency keys for all ticks, eliminating DST double-fire.
- Overlap lock with 4-hour lease on month-end close; mid-run deploys call resume, not re-enqueue.
- Missed policy = skip for all write-capable schedules; coalesce only on read-only scans.
- Schedule version in audit trail tied to compliance logs so auditors see which prompt/tool manifest ran for each period.
Duplicate journal rate on scheduled closes fell from 19% to 0.6% (remaining cases were manual reruns without new idempotency keys). Mean time to detect scheduler bugs dropped because overlap and skip metrics dashboarded alongside agent success rate.
Technique decision table
| Approach | Strengths | Weaknesses | Best for |
|---|---|---|---|
| In-process cron library | Simple, low ops | Dies with process; DST and overlap bugs | Dev/staging only |
| Dedicated scheduler + queue (recommended) | Durable enqueue, horizontal workers, uniform with webhooks | Requires registry and idempotency discipline | Production agents with write tools |
| External orchestrator (Airflow, Temporal) | Rich DAG, backfill UI, mature observability | Heavier; agent-specific payload mapping | Multi-step pipelines mixing agents and ETL |
| Webhook-only (no cron) | Event-exact timing | No periodic sweeps; gaps if events drop | Purely reactive workflows |
Common pitfalls
- Treating cron as fire-and-forget HTTP — scheduled agents need the same queue, dedup, and DLQ path as webhooks.
- Local-time idempotency keys — DST fall-back duplicates every job that keys on hour-of-day.
- Catch-up storms after outage — replaying 48 hourly ticks floods APIs and duplicates writes.
- No overlap guard on long jobs — stack depth grows until workers OOM or rate limits trip.
- Restart without resume — process boot re-enqueues active runs unless lease + checkpoint exist.
- Silent timezone defaults — UTC vs tenant local causes reports to close on the wrong business day.
- Cron for sub-minute latency — use webhooks or streaming; cron libraries are not real-time triggers.
Production checklist
- Store schedules in a versioned registry with tenant binding.
- Use IANA timezones; test DST spring and fall edges quarterly.
- Define missed-window policy per schedule (skip / coalesce / replay).
- Set overlap policy; forbid parallel runs on write-capable agents.
- Generate idempotency keys from schedule_id + version + UTC tick.
- Enqueue through the same durable queue as webhook ingress.
- Resume from checkpoint on deploy; never blind re-enqueue.
- Attach period boundaries (start/end) to scheduled payloads.
- Dashboard overlap skips, missed ticks, and duplicate-key rejects.
- Route exhausted scheduled runs to DLQ with tick metadata for replay.
Key takeaways
- Cron triggers need the same durability as webhooks — registry, queue, idempotency, DLQ.
- DST and overlap are the top duplicate-run sources — key on UTC instants and forbid overlapping writes.
- Missed-window policy must be explicit — skip for period-bound closes; coalesce for read sweeps.
- Harbor Finance cut duplicate closes from 19% to 0.6% with UTC keys, overlap locks, and skip-on-miss.
- Checkpoint resume beats restart-enqueue on deploy for long scheduled agent runs.
Related reading
- LLM agent webhook and async job queue systems explained — event ingress, dedup and durable triggers
- LLM agent durable state and checkpointing explained — resume long runs after deploy
- LLM agent retry and backoff explained — safe retries without duplicate side effects
- LLM agent dead letter queue explained — triage failed scheduled jobs