Guide
LLM agent sandbox code execution systems explained
Harbor Analytics shipped a data-science agent that could write and run
Python to clean CSVs, plot trends, and join warehouse tables. The first
version executed code in the same container as the API — with
access to environment variables, the host filesystem, and outbound
network. A user prompt that asked the model to “debug the
connection string” caused generated code to print
os.environ into the tool observation. Security review found
19% of production runs in the first month touched
sensitive paths or env vars. Two incidents escalated to full credential
rotation. Platform engineering rebuilt execution behind a dedicated
sandbox tier before the product could scale beyond internal teams.
Sandbox code execution for LLM agents means running model-authored programs in an isolated runtime with explicit resource ceilings, no ambient secrets, and controlled egress — then returning only sanitized stdout, stderr, and structured artifacts to the agent loop. This guide covers isolation tiers, policy surfaces, output capture integration with observation budgets and approval gates, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.
Why agents need a separate execution plane
Code-interpreter tools are not “just another HTTP call.” The model emits arbitrary logic that runs with the privileges of whatever process hosts it. Without a boundary, a single successful prompt injection or hallucinated import can:
- Read API keys from environment variables or mounted secrets.
- Scan internal networks (
169.254.169.254metadata, Redis, Postgres onlocalhost). - Exfiltrate user uploads or neighboring tenants’ files on shared disks.
- Consume unbounded CPU/RAM and stall the entire worker fleet.
Treat execution as a separate trust domain from the orchestrator. The orchestrator holds long-lived credentials; the sandbox receives only scoped, short-lived tokens for explicitly allowed operations (see credential injection). Every run should be assume-compromised: even benign user intent can produce destructive code when the model misjudges scope.
Isolation tiers: from process jails to microVMs
Pick isolation depth from blast radius, startup latency, and cost per invocation. Most production stacks use different tiers for different tool classes.
Lightweight: subprocess + seccomp + namespaces
Fork a child with dropped capabilities, no_new_privileges,
read-only root, and a tmpfs workspace. Fast cold start (tens of ms)
but kernel syscall surface remains. Suitable only for trusted tenants
or heavily restricted DSLs, not general Python.
Container + gVisor / Kata
OCI image per language runtime; gVisor or Kata intercept syscalls for stronger kernel isolation than vanilla Docker. Startup typically 200–800 ms with image cache. Good default for multi-tenant SaaS agents at moderate scale.
MicroVM (Firecracker, Cloud Hypervisor)
One lightweight VM per execution with minimal device model. Strongest isolation among common options; cold start 300 ms–2 s depending on snapshot pooling. Prefer for untrusted code, crypto-adjacent workloads, or when compliance demands hardware-enforced boundaries.
WASM runtimes (Wasmtime, Wasmer)
Compile or interpret sandboxed bytecode with linear memory and no
ambient I/O unless host functions grant it. Excellent for small
transforms and formula evaluation; limited for arbitrary
pandas/numpy stacks unless you ship large
WASI bundles.
Pool warm sandboxes to hide cold-start latency: maintain N idle microVMs or containers, assign on tool invoke, destroy or hard-reset after each run. Never reuse a dirty filesystem across tenants without verified wipe.
Policy surfaces: network, filesystem, and syscalls
Isolation without policy still leaks data. Enforce defaults at the infrastructure layer, not in generated code.
Network egress
Default deny-all outbound. Allowlist only domains the
tool needs (e.g. pypi.org if pip install is enabled, or
none if dependencies are pre-baked). Block link-local, RFC1918, and
cloud metadata IPs. Log every allowed connection with run ID for
forensics.
Filesystem
Mount a fresh ephemeral workspace per run. Inject user files read-only
at known paths; write outputs to a quota-bounded directory. Hide
/proc secrets, deny mount operations, and block access to
host paths even via ../../../ tricks.
Syscalls and capabilities
Use seccomp profiles or runtime-specific syscall filters. Drop
CAP_SYS_ADMIN, CAP_NET_RAW, and ptrace.
Disable execve chains if your threat model forbids
subprocess spawning from user code.
Language/runtime hardening
Pin interpreter versions, disable dangerous modules where possible
(subprocess, raw sockets), and pre-install dependencies
so production runs never need live pip install from the
open internet.
Resource limits and kill semantics
Models write infinite loops. Sandboxes must enforce hard ceilings and return structured timeout errors the agent can reason about (see tool error envelopes).
- Wall-clock timeout — 30–120 s typical for interactive agents; shorter for calculator-style tools.
- CPU quota — cgroup limit so one run cannot saturate a host.
- Memory cap — OOM kill the sandbox, not the
orchestrator; return
MemoryLimitExceededobservation. - Output size cap — truncate stdout beyond N KB and spill large artifacts to object storage with signed URLs.
- Concurrency per tenant — queue excess runs; integrate with rate limiting.
On timeout or OOM, destroy the sandbox instance. Partial stdout from a
killed run should not be trusted as complete — mark observations
partial: true so the planner can retry with a smaller
dataset or different approach.
Capturing and returning results to the agent loop
The orchestrator’s contract with the model is structured observations, not raw terminal dumps.
- Capture streams — stdout, stderr, exit code, duration_ms, memory_peak_mb.
- Artifact manifest — list files produced (plots, CSVs) with MIME type, size, and storage URI.
- Redaction pass — scan output for JWTs, AWS
keys, connection strings; replace with
[REDACTED]and alert if matches exceed threshold. - Compression — apply observation budgets before the next model turn; summarize huge tables inline.
- Checkpoint side effects — record execution ID in the WAL if durable runs must survive worker restarts.
Never pass sandbox filesystem paths that resolve on the host into the model context. Use opaque artifact IDs the download tool understands.
Human gates and high-risk operations
Some operations should never run without explicit approval even inside
a sandbox: installing new packages, enabling network egress, reading
user-uploaded archives, or executing more than K lines of generated
code per run. Wire these to
approval gates
and
guardrail layers
that inspect the code string before launch. Static analysis
(AST deny lists for eval, __import__,
open('/etc) catches obvious exfil patterns; it is not a
substitute for kernel-level isolation.
Harbor Analytics refactor walkthrough
Harbor replaced in-container exec() with a
SandboxExecutionService:
- Executor pool — Firecracker microVMs with 512 MB RAM, 60 s wall clock, deny-all egress by default.
- Staging broker — user CSVs copied into VM workspace via one-way virtio-fs; no reverse mount.
- Secret broker — warehouse queries run via scoped HTTP proxy outside the VM; sandbox never sees DB passwords.
- Output sanitizer — regex + entropy scanner on stdout; blocks observations containing high-confidence secrets.
- Telemetry — per-run spans for boot, execute, redact, teardown linked to agent traces.
Results after ten weeks: runs touching sensitive host paths 19% → 0.3% (remainder were false positives in sanitizer tuning), median execution latency increased only 380 ms thanks to VM pooling, and security sign-off unlocked external customer onboarding. Duplicate incident-driven credential rotations dropped to zero.
Technique decision table
| Scenario | Prefer | Avoid |
|---|---|---|
| Multi-tenant SaaS code interpreter | MicroVM or gVisor + deny egress | Host subprocess with env inherit |
| Simple math / JSON transforms | WASM or restricted DSL | Full Python per call |
| Low-latency internal-only tools | Warm containers + strict seccomp | Shared worker interpreter |
| Needs pip install at runtime | Pre-baked images + internal mirror | Open internet from sandbox |
| Produces large plots/files | Artifact store + manifest observation | Base64 in stdout |
| Regulated data | Single-tenant VMs + encryption at rest | Shared tmpfs across customers |
| Long-running notebooks | Session VMs with idle TTL + checkpoint | Unbounded persistent shells |
Common pitfalls
- Secrets in orchestrator env — inherited by child process even in “sandboxed” containers.
- Reused sandboxes without reset — cross-run file and memory leaks between tenants.
- Egress allowlist too broad —
*.amazonaws.combecomes exfil tunnel. - Trusting partial stdout — model acts on truncated or mid-crash output.
- No output redaction — user sees secrets the model echoed from generated code.
- Sandbox-only security — skipping pre-exec code review and approval for write paths.
- Unbounded artifacts — disk fill from agent-generated multi-GB dumps.
Production checklist
- Run untrusted code only in dedicated microVM/container tier, never API host.
- Default deny network egress; allowlist per tool with audit logs.
- Ephemeral workspace per run; no cross-tenant filesystem reuse.
- Enforce wall-clock, CPU, memory, and stdout size limits.
- Inject secrets via external brokers; zero env vars in sandbox.
- Redact stdout/stderr for tokens, keys, and connection strings.
- Return structured observations with exit code, duration, artifact manifest.
- Pool warm instances; hard-destroy after each execution.
- Gate pip install, egress, and large code blobs behind human approval.
- Load-test fork bombs and infinite loops; verify orchestrator stays healthy.
Key takeaways
- Code execution is a trust boundary — treat every model-generated program as hostile.
- Isolation tier matches blast radius — microVMs for multi-tenant; WASM for tiny transforms.
- Policy beats prompt engineering — deny egress and secrets at the runtime, not in instructions.
- Observations must be sanitized before the next model turn.
- Harbor Analytics cut sensitive-path touches from 19% to 0.3% with microVM sandboxes and a secret broker.
Related reading
- LLM agent secrets and credential injection explained — vault brokers and least-privilege tool access
- LLM agent permission scoping and tool approval gates explained — policy before dangerous tools run
- LLM tool error handling explained — structured timeout and OOM observations
- LLM agent guardrails and output validation explained — pre-exec code inspection layers