Guide

LLM agent sandbox code execution systems explained

Harbor Analytics shipped a data-science agent that could write and run Python to clean CSVs, plot trends, and join warehouse tables. The first version executed code in the same container as the API — with access to environment variables, the host filesystem, and outbound network. A user prompt that asked the model to “debug the connection string” caused generated code to print os.environ into the tool observation. Security review found 19% of production runs in the first month touched sensitive paths or env vars. Two incidents escalated to full credential rotation. Platform engineering rebuilt execution behind a dedicated sandbox tier before the product could scale beyond internal teams.

Sandbox code execution for LLM agents means running model-authored programs in an isolated runtime with explicit resource ceilings, no ambient secrets, and controlled egress — then returning only sanitized stdout, stderr, and structured artifacts to the agent loop. This guide covers isolation tiers, policy surfaces, output capture integration with observation budgets and approval gates, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.

Why agents need a separate execution plane

Code-interpreter tools are not “just another HTTP call.” The model emits arbitrary logic that runs with the privileges of whatever process hosts it. Without a boundary, a single successful prompt injection or hallucinated import can:

  • Read API keys from environment variables or mounted secrets.
  • Scan internal networks (169.254.169.254 metadata, Redis, Postgres on localhost).
  • Exfiltrate user uploads or neighboring tenants’ files on shared disks.
  • Consume unbounded CPU/RAM and stall the entire worker fleet.

Treat execution as a separate trust domain from the orchestrator. The orchestrator holds long-lived credentials; the sandbox receives only scoped, short-lived tokens for explicitly allowed operations (see credential injection). Every run should be assume-compromised: even benign user intent can produce destructive code when the model misjudges scope.

Isolation tiers: from process jails to microVMs

Pick isolation depth from blast radius, startup latency, and cost per invocation. Most production stacks use different tiers for different tool classes.

Lightweight: subprocess + seccomp + namespaces

Fork a child with dropped capabilities, no_new_privileges, read-only root, and a tmpfs workspace. Fast cold start (tens of ms) but kernel syscall surface remains. Suitable only for trusted tenants or heavily restricted DSLs, not general Python.

Container + gVisor / Kata

OCI image per language runtime; gVisor or Kata intercept syscalls for stronger kernel isolation than vanilla Docker. Startup typically 200–800 ms with image cache. Good default for multi-tenant SaaS agents at moderate scale.

MicroVM (Firecracker, Cloud Hypervisor)

One lightweight VM per execution with minimal device model. Strongest isolation among common options; cold start 300 ms–2 s depending on snapshot pooling. Prefer for untrusted code, crypto-adjacent workloads, or when compliance demands hardware-enforced boundaries.

WASM runtimes (Wasmtime, Wasmer)

Compile or interpret sandboxed bytecode with linear memory and no ambient I/O unless host functions grant it. Excellent for small transforms and formula evaluation; limited for arbitrary pandas/numpy stacks unless you ship large WASI bundles.

Pool warm sandboxes to hide cold-start latency: maintain N idle microVMs or containers, assign on tool invoke, destroy or hard-reset after each run. Never reuse a dirty filesystem across tenants without verified wipe.

Policy surfaces: network, filesystem, and syscalls

Isolation without policy still leaks data. Enforce defaults at the infrastructure layer, not in generated code.

Network egress

Default deny-all outbound. Allowlist only domains the tool needs (e.g. pypi.org if pip install is enabled, or none if dependencies are pre-baked). Block link-local, RFC1918, and cloud metadata IPs. Log every allowed connection with run ID for forensics.

Filesystem

Mount a fresh ephemeral workspace per run. Inject user files read-only at known paths; write outputs to a quota-bounded directory. Hide /proc secrets, deny mount operations, and block access to host paths even via ../../../ tricks.

Syscalls and capabilities

Use seccomp profiles or runtime-specific syscall filters. Drop CAP_SYS_ADMIN, CAP_NET_RAW, and ptrace. Disable execve chains if your threat model forbids subprocess spawning from user code.

Language/runtime hardening

Pin interpreter versions, disable dangerous modules where possible (subprocess, raw sockets), and pre-install dependencies so production runs never need live pip install from the open internet.

Resource limits and kill semantics

Models write infinite loops. Sandboxes must enforce hard ceilings and return structured timeout errors the agent can reason about (see tool error envelopes).

  • Wall-clock timeout — 30–120 s typical for interactive agents; shorter for calculator-style tools.
  • CPU quota — cgroup limit so one run cannot saturate a host.
  • Memory cap — OOM kill the sandbox, not the orchestrator; return MemoryLimitExceeded observation.
  • Output size cap — truncate stdout beyond N KB and spill large artifacts to object storage with signed URLs.
  • Concurrency per tenant — queue excess runs; integrate with rate limiting.

On timeout or OOM, destroy the sandbox instance. Partial stdout from a killed run should not be trusted as complete — mark observations partial: true so the planner can retry with a smaller dataset or different approach.

Capturing and returning results to the agent loop

The orchestrator’s contract with the model is structured observations, not raw terminal dumps.

  1. Capture streams — stdout, stderr, exit code, duration_ms, memory_peak_mb.
  2. Artifact manifest — list files produced (plots, CSVs) with MIME type, size, and storage URI.
  3. Redaction pass — scan output for JWTs, AWS keys, connection strings; replace with [REDACTED] and alert if matches exceed threshold.
  4. Compression — apply observation budgets before the next model turn; summarize huge tables inline.
  5. Checkpoint side effects — record execution ID in the WAL if durable runs must survive worker restarts.

Never pass sandbox filesystem paths that resolve on the host into the model context. Use opaque artifact IDs the download tool understands.

Human gates and high-risk operations

Some operations should never run without explicit approval even inside a sandbox: installing new packages, enabling network egress, reading user-uploaded archives, or executing more than K lines of generated code per run. Wire these to approval gates and guardrail layers that inspect the code string before launch. Static analysis (AST deny lists for eval, __import__, open('/etc) catches obvious exfil patterns; it is not a substitute for kernel-level isolation.

Harbor Analytics refactor walkthrough

Harbor replaced in-container exec() with a SandboxExecutionService:

  1. Executor pool — Firecracker microVMs with 512 MB RAM, 60 s wall clock, deny-all egress by default.
  2. Staging broker — user CSVs copied into VM workspace via one-way virtio-fs; no reverse mount.
  3. Secret broker — warehouse queries run via scoped HTTP proxy outside the VM; sandbox never sees DB passwords.
  4. Output sanitizer — regex + entropy scanner on stdout; blocks observations containing high-confidence secrets.
  5. Telemetry — per-run spans for boot, execute, redact, teardown linked to agent traces.

Results after ten weeks: runs touching sensitive host paths 19% → 0.3% (remainder were false positives in sanitizer tuning), median execution latency increased only 380 ms thanks to VM pooling, and security sign-off unlocked external customer onboarding. Duplicate incident-driven credential rotations dropped to zero.

Technique decision table

ScenarioPreferAvoid
Multi-tenant SaaS code interpreterMicroVM or gVisor + deny egressHost subprocess with env inherit
Simple math / JSON transformsWASM or restricted DSLFull Python per call
Low-latency internal-only toolsWarm containers + strict seccompShared worker interpreter
Needs pip install at runtimePre-baked images + internal mirrorOpen internet from sandbox
Produces large plots/filesArtifact store + manifest observationBase64 in stdout
Regulated dataSingle-tenant VMs + encryption at restShared tmpfs across customers
Long-running notebooksSession VMs with idle TTL + checkpointUnbounded persistent shells

Common pitfalls

  • Secrets in orchestrator env — inherited by child process even in “sandboxed” containers.
  • Reused sandboxes without reset — cross-run file and memory leaks between tenants.
  • Egress allowlist too broad*.amazonaws.com becomes exfil tunnel.
  • Trusting partial stdout — model acts on truncated or mid-crash output.
  • No output redaction — user sees secrets the model echoed from generated code.
  • Sandbox-only security — skipping pre-exec code review and approval for write paths.
  • Unbounded artifacts — disk fill from agent-generated multi-GB dumps.

Production checklist

  • Run untrusted code only in dedicated microVM/container tier, never API host.
  • Default deny network egress; allowlist per tool with audit logs.
  • Ephemeral workspace per run; no cross-tenant filesystem reuse.
  • Enforce wall-clock, CPU, memory, and stdout size limits.
  • Inject secrets via external brokers; zero env vars in sandbox.
  • Redact stdout/stderr for tokens, keys, and connection strings.
  • Return structured observations with exit code, duration, artifact manifest.
  • Pool warm instances; hard-destroy after each execution.
  • Gate pip install, egress, and large code blobs behind human approval.
  • Load-test fork bombs and infinite loops; verify orchestrator stays healthy.

Key takeaways

  • Code execution is a trust boundary — treat every model-generated program as hostile.
  • Isolation tier matches blast radius — microVMs for multi-tenant; WASM for tiny transforms.
  • Policy beats prompt engineering — deny egress and secrets at the runtime, not in instructions.
  • Observations must be sanitized before the next model turn.
  • Harbor Analytics cut sensitive-path touches from 19% to 0.3% with microVM sandboxes and a secret broker.

Related reading