Guide

LLM sandbox execution explained

Harbor Analytics' portfolio stress-test copilot was built on program-aided language (PAL): the model writes Python, the runtime executes it, and numeric results flow back into the answer. On a Tuesday morning demo, the model generated a harmless-looking loop over bond holdings — then imported os, read os.environ, and printed database connection strings into the chat transcript. The code ran in the same process as the API server. Nothing in the prompt said “do not read environment variables”; the model simply found a path that worked.

Sandbox execution is the infrastructure layer that runs model-generated code in an isolated environment with bounded CPU, memory, time, and network access. It is not optional for any production feature that executes LLM output as code — whether PAL math, data transforms, chart generation, or agent tool scripts. After Harbor moved execution into ephemeral containers with deny-by-default egress and a curated import allowlist, credential leaks dropped to zero and runaway loops stopped taking down the API tier. This guide covers sandbox taxonomy, resource governors, filesystem and network policy, output capture, the Harbor Analytics refactor, a technique decision table versus in-process eval, pitfalls, and a production checklist.

Why sandboxing is non-negotiable

LLM-generated code is untrusted input in the same threat class as user- uploaded files or arbitrary HTTP payloads. Models hallucinate imports, follow adversarial instructions embedded in retrieved documents, and occasionally produce infinite loops or memory bombs. Running that code in your application process gives it the same privileges as your server: environment secrets, local filesystem, internal network routes, and sibling customer data.

Production teams adopt sandboxes for four reasons:

  • Confidentiality — block reads of secrets, metadata services, and neighbor tenant data.
  • Integrity — prevent writes to production databases or config files even when the model calls open() with a plausible path.
  • Availability — cap CPU and memory so one bad snippet cannot starve the API tier; pair with agent loop termination for orchestration-level limits.
  • Compliance — demonstrate isolation boundaries for SOC 2 and customer security questionnaires.

Sandboxing does not replace prompt-injection defenses or output validation; it contains damage when those layers fail.

Sandbox taxonomy

Choose isolation depth based on language surface, latency budget, and threat model. Common patterns:

In-process interpreters with restricted builtins

Lightweight wrappers (e.g. RestrictedPython, custom AST walkers) strip dangerous builtins and block imports. Fast to spin up but brittle: new escape gadgets appear, and stdlib modules often expose indirect filesystem or network access. Acceptable only for demos or strictly numeric PAL with no I/O.

OS-level containers (Docker, gVisor, Firecracker microVMs)

Each execution gets a fresh container or microVM with read-only root filesystem, tmpfs workspace, cgroup CPU and memory limits, and optional seccomp profiles. Cold-start latency (hundreds of milliseconds to seconds) is the main cost; warm pools and snapshot-based runners amortize it. This is the default for multi-tenant SaaS code interpreters.

WebAssembly (WASM) runtimes

Compile or interpret guest code inside a WASM VM with explicit host imports. Strong memory isolation and fast startup; limited stdlib unless you bundle one. Good for JavaScript, Rust, or Python subsets (Pyodide) when you control the guest toolchain.

Remote managed sandboxes (E2B, Modal, cloud code-runner APIs)

Outsource isolation to a vendor that maintains hardened images and scaling. Trade operational burden for vendor lock-in, data residency review, and per-minute billing. Often the fastest path when you are not ready to operate container pools yourself.

Language-specific jails (nsjail, bubblewrap, isolate)

Single-binary wrappers around unshare, namespaces, and rlimits. Lower overhead than full Docker for competitive-programming-style runners; you still maintain images and syscall policies.

Resource governors and kill switches

Every sandbox session needs hard envelopes enforced by the orchestrator, not by trusting the guest language runtime:

  • Wall-clock timeout — typically 5–30 seconds for interactive PAL; longer for batch analytics with explicit user consent.
  • CPU quota — cgroup cpu.max or equivalent; prevents tight loops from pinning a core.
  • Memory cap — OOM-kill the sandbox before the host thrashes; 256 MB–1 GB is common for pandas workloads.
  • Output size limit — truncate stdout/stderr beyond N kilobytes before it floods the agent context.
  • Process count — block fork bombs via pids.max or seccomp.
  • Filesystem quota — tmpfs with fixed size; no persistent volumes unless explicitly mounted read-only.

Log kill reason (timeout, OOM, policy violation) with the code hash and session ID for tuning and incident response.

Filesystem and network policy

Default-deny is the only sustainable posture:

  • Filesystem — writable scratch directory only; no access to /etc, host mounts, or cloud metadata paths (169.254.169.254).
  • Network — block all egress unless the use case requires package install (prefer pre-baked images) or allowlisted HTTPS endpoints via a sidecar proxy that strips cookies and internal headers.
  • Import allowlist — pre-install numpy, pandas, sympy in the image; reject import os, subprocess, socket, ctypes at static analysis time before execution.
  • Secrets — never inject API keys into the sandbox environment; pass only the minimal data the snippet needs as stdin or a read-only JSON file.

Static analysis (AST walk) catches obvious imports; runtime seccomp catches syscalls the analyzer missed. Use both.

Harbor Analytics PAL refactor (worked example)

Harbor's stress-test copilot previously called exec() in the FastAPI worker with a 10-second signal.alarm timeout. Problems: shared process memory, visible env vars, no network block, and a hung C extension could ignore SIGALRM.

Refactor steps:

  1. Image hardening — Python 3.12 slim plus numpy, pandas, scipy only; no pip at runtime.
  2. Runner service — sidecar pool of gVisor-backed containers; one execution per container; destroy after each run.
  3. Pre-flight AST gate — reject imports outside allowlist and any __dunder__ attribute access on modules.
  4. Data injection — portfolio CSV written to /tmp/input.csv inside sandbox; no DB connection strings anywhere in the guest environment.
  5. Result envelope — capture stdout, stderr, exit code, and peak memory; return structured JSON to the ReAct loop as a tool observation.
  6. Observability — metric sandbox_kill_reason tagged by reason; alert on OOM rate > 2% of sessions.

Outcome: median execution latency rose 180 ms (container warm pool), but security review passed without exceptions and API-tier OOM incidents went to zero.

Technique decision table

Approach Best for Latency Isolation strength Ops burden
In-process eval / RestrictedPython Local demos, trusted operators only < 10 ms Weak Low
WASM (Pyodide, QuickJS) Browser-side or edge math; limited stdlib 50–200 ms Strong memory; host imports define ceiling Medium
Container per run (Docker/gVisor) Multi-tenant SaaS PAL and data agents 200 ms–2 s cold; < 300 ms warm Strong High
Managed sandbox API Fast MVP, variable load Vendor-dependent Strong (vendor SLO) Low (vendor)
No code execution (pure function calling) Fixed operations with typed APIs Lowest Strongest (no arbitrary code) Low

Prefer function calling when operations are enumerable; reach for sandboxes when the model must compose novel data transforms or statistical code you cannot pre-declare as tools.

Common pitfalls

  • Trusting prompt rules alone — “Only use pandas” does not stop import os; enforce in AST and seccomp.
  • Reusing containers without reset — filesystem and global state leak between users; one container per tenant session minimum.
  • Mounting the Docker socket — guest code can escape to the host; use a dedicated runner service.
  • Allowing pip install at runtime — opens network egress and supply-chain risk; bake dependencies into images.
  • Returning raw tracebacks to users — may expose paths and library versions; sanitize and log server-side.
  • No concurrency cap — an agent loop can fork hundreds of sandboxes; queue with backpressure.
  • Ignoring side channels — timing and memory usage can leak information across co-hosted sandboxes on weak isolation; use microVMs for high-sensitivity workloads.
  • Sandbox without output validation — executed code can still produce malicious strings that become indirect injections in the next turn.

Production checklist

  • Never execute model-generated code in the API process on multi-tenant paths.
  • Enforce wall-clock, CPU, memory, and output-size limits per execution.
  • Run static import analysis before launching the sandbox.
  • Use default-deny network egress; allowlist only when required.
  • Provide data via injected files or stdin, not live database handles.
  • Destroy or fully reset isolation boundary after each run.
  • Pool warm sandboxes to meet interactive latency SLOs.
  • Log kill reason, duration, and memory peak for every execution.
  • Rate-limit sandbox spawns per user and per agent session.
  • Sanitize stdout before inserting into LLM context or user-visible chat.
  • Red-team with escape attempts (env read, fork bomb, metadata curl) in CI.
  • Document data residency for managed sandbox vendors in your security packet.

Key takeaways

  • Model-generated code is untrusted input — treat it like running a user-uploaded binary, not like executing your own scripts.
  • Container or microVM isolation with default-deny network and import allowlists is the production baseline for PAL and code-interpreter agents.
  • Harbor Analytics eliminated credential leaks by moving PAL execution off the API tier into ephemeral gVisor containers with structured result envelopes.
  • Prefer typed function calling when operations are fixed; sandboxes earn their complexity when the model must compose novel analytics code.
  • Sandboxing contains blast radius — pair it with prompt-injection defenses, tool error handling, and loop termination for defense in depth.

Related reading