Guide
LLM computer use and browser automation agents explained
Harbor Insurance’s internal claims team spent forty minutes per
complex auto claim copying fields between a 2009 policy admin portal,
a PDF adjuster worksheet, and a modern CRM. The vendor offered no API;
screen-scrape RPA broke every time CSS class names changed. An early
ReAct agent
with a single fetch_html tool hallucinated dropdown values
it never saw. Human reviewers rejected 38% of
agent-submitted claim packets for mismatched VIN digits or wrong
coverage tier — errors tied to stale DOM snapshots and missing
visual context on tabbed layouts.
Computer-use agents close the loop: the model observes a live browser or desktop surface, proposes low-level actions (click, type, scroll), executes them in a sandboxed runtime, and re-observes until the task completes or escalates to a human. This guide covers vision-first vs DOM-first architectures, action and observation schemas, set-of-mark grounding, credential isolation, step budgets and durable checkpoints, the Harbor Insurance refactor, a technique decision table vs traditional APIs and RPA, pitfalls, and a production checklist.
What “computer use” means in production
Computer use is not one product feature — it is an observation–action loop where the LLM acts as a policy over UI state. Each turn typically includes:
- Observe — screenshot, accessibility tree slice, or hybrid.
- Plan — next action with coordinates, element ref, or semantic target.
- Act — runtime executes click/type/scroll in an isolated session.
- Verify — diff observation against expected postcondition.
Browser automation is the most common deployment shape: Playwright or Puppeteer behind a tool server, often paired with a vision-capable model. Full desktop control (mouse/keyboard across arbitrary apps) shares the same loop but raises security and reproducibility costs. Most teams start with single-origin browser sandboxes before expanding scope.
Vision-first vs DOM-first observation
The observation channel defines reliability and cost. Production systems usually implement both and route by page class.
Vision-first (screenshot + coordinates)
The model receives a PNG (often resized) and returns normalized
(x, y) clicks or bounding boxes. Strengths: works on
canvas UIs, PDF viewers, and legacy portals with opaque DOM. Weaknesses:
high token cost per step, sensitivity to theme/zoom changes, and
coordinate drift on responsive layouts. Mitigate with
set-of-mark (SoM) overlays: numbered labels on
interactive regions so the model outputs click(7) instead
of raw pixels.
DOM-first (accessibility tree and selectors)
The runtime exposes a pruned accessibility tree or a list of
{ref, role, name, value} nodes. Actions target
ref IDs. Strengths: deterministic, cheap in tokens, easy
to log and replay. Weaknesses: breaks on shadow DOM quirks, custom
widgets without ARIA, and cross-origin iframes. Harbor Insurance
defaulted to DOM on stable form pages and fell back to vision on the
policy tab strip where refs shifted between releases.
Hybrid routing heuristic
If interactive_node_count < 80 and the page URL matches
an allowlisted origin, use DOM. If observation diff after an action
is empty twice in a row, escalate to screenshot + SoM for that step.
Log which channel succeeded for future routing tables.
Action spaces and tool design
Treat browser control as a small, typed tool API — not free-form shell. A practical minimal set:
navigate(url)— only allowlisted hosts.click(target)— ref ID or SoM index.type(target, text, submit?)— clears field first.scroll(direction, amount)— or scroll-into-view on target.select_option(target, value)— prefer value over visible label.wait_for(selector | text, timeout_ms)— explicit sync.screenshot()— returns image handle, not base64 in chat history.extract_table(selector)— structured read for verification.
Cap concurrent tabs at one unless the task spec requires comparison.
Reject actions outside the viewport unless preceded by
scroll_into_view. Return structured observations via
tool error envelopes
when elements are missing — never silent no-ops.
Sandboxing, credentials, and blast radius
A computer-use agent with a real browser session is a remote code execution surface. Non-negotiable controls:
- Network egress allowlist — block arbitrary navigation and file downloads.
- Ephemeral profiles — fresh browser context per run; no persistent cookies across tenants.
- Credential injection — vault fills login forms; secrets never enter the model context.
- Download quarantine — scan and discard unless task explicitly needs a file artifact.
- Human gates on submit/pay/delete — pause before irreversible actions.
- Session recording — HAR + screenshot sequence for audit, with PII redaction.
Run browsers in disposable VMs or containers with no access to internal subnets beyond the target app’s reverse proxy. Pair with PII redaction on traces before they reach shared logging pipelines.
Harbor Insurance refactor
Harbor replaced brittle RPA macros with a human-in-the-loop computer-use agent for “standard auto” claims (single vehicle, no litigation flags). Architecture:
- Playwright in a gVisor sandbox; egress limited to
claims.harbor-ins.internaland the CRM API origin. - DOM-first form fill; SoM vision fallback on tab navigation.
- Checkpoint after each major section (policy, damage, coverage).
- Adjuster approval required before CRM submit; agent pre-fills only.
- Golden replay tests on recorded HTML fixtures every deploy.
Outcomes after six weeks on a 120-claim pilot: manual handoff time 40 min → 18 min median; first-pass field accuracy 62% → 91%; RPA maintenance tickets 11/month → 1/month. Cost per claim rose $0.34 in model fees but saved $12 in adjuster labor at their blended rate.
Technique decision table
| Approach | Best when | Avoid when |
|---|---|---|
| REST/GraphQL API integration | Vendor exposes stable, documented endpoints | Never — always prefer APIs if available |
| DOM browser agent | Standard HTML forms, internal admin tools, moderate step count | Heavy canvas/WebGL UIs, offline desktop apps |
| Vision + SoM agent | Legacy layouts, PDF portals, irregular widgets | High-frequency micro-steps (use DOM or API) |
| Traditional RPA (selector macros) | Frozen UI, zero LLM budget, deterministic replay only | Frequent CSS/DOM changes, judgment-heavy branching |
| Full desktop computer use | Cross-app workflows (Excel + ERP + email) | Untrusted input, internet-facing agents, mobile targets |
| Subagent per page | Multi-site workflows with different tool policies | Single-form tasks (overhead dominates) |
Reliability patterns
Step budgets and stall detection
Cap steps per subtask (Harbor: 25 for policy lookup, 40 for full packet).
If the same URL and screenshot hash repeat three times, abort with
stuck_loop and surface to a human. Track
termination predicates
separately from model stop reasons.
Verification before handoff
After fill operations, run a read-back pass: extract critical fields from the DOM and compare to the source worksheet JSON. Mismatches trigger one self-correction attempt, then escalation.
Evaluation and regression
Build a frozen fixture suite: saved HTML + expected action traces. Score task success, field accuracy, and steps-to-complete in CI. See agent evaluation for trajectory metrics beyond final answer correctness.
Common pitfalls
- Dumping full screenshots into chat history — blows context; store externally and pass handles.
- No allowlist on navigate — one prompt injection leads to exfiltration.
- Trusting visible button text — “Submit” vs “Submit draft”; verify URL and postcondition.
- Ignoring loading states — click before network idle causes mis-clicks; always wait.
- Cross-tenant session reuse — catastrophic privacy failure; isolate contexts.
- Vision-only on dense tables — OCR errors on policy numbers; use DOM extract.
- Skipping human approval on commit actions — agents should draft, not silently publish.
Designer and engineer checklist
- Confirm no API or bulk export exists before investing in UI agents.
- Define allowlisted origins and forbidden action classes (pay, delete, email send).
- Implement DOM-first with vision/SoM fallback and log routing decisions.
- Keep action vocabulary small; return structured observations on failure.
- Inject credentials from vault; never log secrets or full screenshots with PII.
- Checkpoint after each logical section; support resume on timeout.
- Require human approval before irreversible submits.
- Cap steps; detect screenshot/DOM hash loops.
- Run read-back verification on critical fields before handoff.
- Maintain HTML fixture replay tests in CI; track steps-to-success.
Key takeaways
- Computer use is an observe–act loop, not a one-shot screenshot prompt.
- DOM-first saves cost; vision + SoM handles legacy UI — hybrid routing wins.
- Sandboxing and allowlists are mandatory — treat the browser as untrusted code execution.
- Harbor cut handoffs 54% and raised first-pass accuracy 62% → 91% on a no-API portal.
- Prefer APIs when they exist; agents fill the integration gap, not replace good engineering.
Related reading
- Sandbox execution — isolating tool and browser runtimes
- ReAct agent loop — the control loop inside each automation session
- Durable agent execution — resuming long form-fill workflows
- Human-in-the-loop — approval gates before irreversible UI actions