Guide

LLM computer use and browser automation agents explained

Harbor Insurance’s internal claims team spent forty minutes per complex auto claim copying fields between a 2009 policy admin portal, a PDF adjuster worksheet, and a modern CRM. The vendor offered no API; screen-scrape RPA broke every time CSS class names changed. An early ReAct agent with a single fetch_html tool hallucinated dropdown values it never saw. Human reviewers rejected 38% of agent-submitted claim packets for mismatched VIN digits or wrong coverage tier — errors tied to stale DOM snapshots and missing visual context on tabbed layouts.

Computer-use agents close the loop: the model observes a live browser or desktop surface, proposes low-level actions (click, type, scroll), executes them in a sandboxed runtime, and re-observes until the task completes or escalates to a human. This guide covers vision-first vs DOM-first architectures, action and observation schemas, set-of-mark grounding, credential isolation, step budgets and durable checkpoints, the Harbor Insurance refactor, a technique decision table vs traditional APIs and RPA, pitfalls, and a production checklist.

What “computer use” means in production

Computer use is not one product feature — it is an observation–action loop where the LLM acts as a policy over UI state. Each turn typically includes:

Observe — screenshot, accessibility tree slice, or hybrid.
Plan — next action with coordinates, element ref, or semantic target.
Act — runtime executes click/type/scroll in an isolated session.
Verify — diff observation against expected postcondition.

Browser automation is the most common deployment shape: Playwright or Puppeteer behind a tool server, often paired with a vision-capable model. Full desktop control (mouse/keyboard across arbitrary apps) shares the same loop but raises security and reproducibility costs. Most teams start with single-origin browser sandboxes before expanding scope.

Vision-first vs DOM-first observation

The observation channel defines reliability and cost. Production systems usually implement both and route by page class.

Vision-first (screenshot + coordinates)

The model receives a PNG (often resized) and returns normalized (x, y) clicks or bounding boxes. Strengths: works on canvas UIs, PDF viewers, and legacy portals with opaque DOM. Weaknesses: high token cost per step, sensitivity to theme/zoom changes, and coordinate drift on responsive layouts. Mitigate with set-of-mark (SoM) overlays: numbered labels on interactive regions so the model outputs click(7) instead of raw pixels.

DOM-first (accessibility tree and selectors)

The runtime exposes a pruned accessibility tree or a list of {ref, role, name, value} nodes. Actions target ref IDs. Strengths: deterministic, cheap in tokens, easy to log and replay. Weaknesses: breaks on shadow DOM quirks, custom widgets without ARIA, and cross-origin iframes. Harbor Insurance defaulted to DOM on stable form pages and fell back to vision on the policy tab strip where refs shifted between releases.

Hybrid routing heuristic

If interactive_node_count < 80 and the page URL matches an allowlisted origin, use DOM. If observation diff after an action is empty twice in a row, escalate to screenshot + SoM for that step. Log which channel succeeded for future routing tables.

Action spaces and tool design

Treat browser control as a small, typed tool API — not free-form shell. A practical minimal set:

navigate(url) — only allowlisted hosts.
click(target) — ref ID or SoM index.
type(target, text, submit?) — clears field first.
scroll(direction, amount) — or scroll-into-view on target.
select_option(target, value) — prefer value over visible label.
wait_for(selector | text, timeout_ms) — explicit sync.
screenshot() — returns image handle, not base64 in chat history.
extract_table(selector) — structured read for verification.

Cap concurrent tabs at one unless the task spec requires comparison. Reject actions outside the viewport unless preceded by scroll_into_view. Return structured observations via tool error envelopes when elements are missing — never silent no-ops.

Sandboxing, credentials, and blast radius

A computer-use agent with a real browser session is a remote code execution surface. Non-negotiable controls:

Network egress allowlist — block arbitrary navigation and file downloads.
Ephemeral profiles — fresh browser context per run; no persistent cookies across tenants.
Credential injection — vault fills login forms; secrets never enter the model context.
Download quarantine — scan and discard unless task explicitly needs a file artifact.
Human gates on submit/pay/delete — pause before irreversible actions.
Session recording — HAR + screenshot sequence for audit, with PII redaction.

Run browsers in disposable VMs or containers with no access to internal subnets beyond the target app’s reverse proxy. Pair with PII redaction on traces before they reach shared logging pipelines.

Harbor Insurance refactor

Harbor replaced brittle RPA macros with a human-in-the-loop computer-use agent for “standard auto” claims (single vehicle, no litigation flags). Architecture:

Playwright in a gVisor sandbox; egress limited to claims.harbor-ins.internal and the CRM API origin.
DOM-first form fill; SoM vision fallback on tab navigation.
Checkpoint after each major section (policy, damage, coverage).
Adjuster approval required before CRM submit; agent pre-fills only.
Golden replay tests on recorded HTML fixtures every deploy.

Outcomes after six weeks on a 120-claim pilot: manual handoff time 40 min → 18 min median; first-pass field accuracy 62% → 91%; RPA maintenance tickets 11/month → 1/month. Cost per claim rose $0.34 in model fees but saved $12 in adjuster labor at their blended rate.

Technique decision table

Approach	Best when	Avoid when
REST/GraphQL API integration	Vendor exposes stable, documented endpoints	Never — always prefer APIs if available
DOM browser agent	Standard HTML forms, internal admin tools, moderate step count	Heavy canvas/WebGL UIs, offline desktop apps
Vision + SoM agent	Legacy layouts, PDF portals, irregular widgets	High-frequency micro-steps (use DOM or API)
Traditional RPA (selector macros)	Frozen UI, zero LLM budget, deterministic replay only	Frequent CSS/DOM changes, judgment-heavy branching
Full desktop computer use	Cross-app workflows (Excel + ERP + email)	Untrusted input, internet-facing agents, mobile targets
Subagent per page	Multi-site workflows with different tool policies	Single-form tasks (overhead dominates)

Reliability patterns

Step budgets and stall detection

Cap steps per subtask (Harbor: 25 for policy lookup, 40 for full packet). If the same URL and screenshot hash repeat three times, abort with stuck_loop and surface to a human. Track termination predicates separately from model stop reasons.

Verification before handoff

After fill operations, run a read-back pass: extract critical fields from the DOM and compare to the source worksheet JSON. Mismatches trigger one self-correction attempt, then escalation.

Evaluation and regression

Build a frozen fixture suite: saved HTML + expected action traces. Score task success, field accuracy, and steps-to-complete in CI. See agent evaluation for trajectory metrics beyond final answer correctness.

Common pitfalls

Dumping full screenshots into chat history — blows context; store externally and pass handles.
No allowlist on navigate — one prompt injection leads to exfiltration.
Trusting visible button text — “Submit” vs “Submit draft”; verify URL and postcondition.
Ignoring loading states — click before network idle causes mis-clicks; always wait.
Cross-tenant session reuse — catastrophic privacy failure; isolate contexts.
Vision-only on dense tables — OCR errors on policy numbers; use DOM extract.
Skipping human approval on commit actions — agents should draft, not silently publish.

Designer and engineer checklist

Confirm no API or bulk export exists before investing in UI agents.
Define allowlisted origins and forbidden action classes (pay, delete, email send).
Implement DOM-first with vision/SoM fallback and log routing decisions.
Keep action vocabulary small; return structured observations on failure.
Inject credentials from vault; never log secrets or full screenshots with PII.
Checkpoint after each logical section; support resume on timeout.
Require human approval before irreversible submits.
Cap steps; detect screenshot/DOM hash loops.
Run read-back verification on critical fields before handoff.
Maintain HTML fixture replay tests in CI; track steps-to-success.

Key takeaways

Computer use is an observe–act loop, not a one-shot screenshot prompt.
DOM-first saves cost; vision + SoM handles legacy UI — hybrid routing wins.
Sandboxing and allowlists are mandatory — treat the browser as untrusted code execution.
Harbor cut handoffs 54% and raised first-pass accuracy 62% → 91% on a no-API portal.
Prefer APIs when they exist; agents fill the integration gap, not replace good engineering.