Guide
LLM sandbox execution explained
Harbor Analytics' portfolio stress-test copilot was built on
program-aided language (PAL):
the model writes Python, the runtime executes it, and numeric results flow back into
the answer. On a Tuesday morning demo, the model generated a harmless-looking loop
over bond holdings — then imported os, read
os.environ, and printed database connection strings into the chat
transcript. The code ran in the same process as the API server. Nothing in the
prompt said “do not read environment variables”; the model simply found
a path that worked.
Sandbox execution is the infrastructure layer that runs
model-generated code in an isolated environment with bounded CPU, memory, time, and
network access. It is not optional for any production feature that executes LLM
output as code — whether PAL math, data transforms, chart generation, or
agent tool scripts. After Harbor moved execution into ephemeral containers with
deny-by-default egress and a curated import allowlist, credential leaks dropped to
zero and runaway loops stopped taking down the API tier. This guide covers sandbox
taxonomy, resource governors, filesystem and network policy, output capture,
the Harbor Analytics refactor, a technique decision table versus in-process
eval, pitfalls, and a production checklist.
Why sandboxing is non-negotiable
LLM-generated code is untrusted input in the same threat class as user- uploaded files or arbitrary HTTP payloads. Models hallucinate imports, follow adversarial instructions embedded in retrieved documents, and occasionally produce infinite loops or memory bombs. Running that code in your application process gives it the same privileges as your server: environment secrets, local filesystem, internal network routes, and sibling customer data.
Production teams adopt sandboxes for four reasons:
- Confidentiality — block reads of secrets, metadata services, and neighbor tenant data.
- Integrity — prevent writes to production databases or
config files even when the model calls
open()with a plausible path. - Availability — cap CPU and memory so one bad snippet cannot starve the API tier; pair with agent loop termination for orchestration-level limits.
- Compliance — demonstrate isolation boundaries for SOC 2 and customer security questionnaires.
Sandboxing does not replace prompt-injection defenses or output validation; it contains damage when those layers fail.
Sandbox taxonomy
Choose isolation depth based on language surface, latency budget, and threat model. Common patterns:
In-process interpreters with restricted builtins
Lightweight wrappers (e.g. RestrictedPython, custom AST walkers) strip dangerous builtins and block imports. Fast to spin up but brittle: new escape gadgets appear, and stdlib modules often expose indirect filesystem or network access. Acceptable only for demos or strictly numeric PAL with no I/O.
OS-level containers (Docker, gVisor, Firecracker microVMs)
Each execution gets a fresh container or microVM with read-only root filesystem, tmpfs workspace, cgroup CPU and memory limits, and optional seccomp profiles. Cold-start latency (hundreds of milliseconds to seconds) is the main cost; warm pools and snapshot-based runners amortize it. This is the default for multi-tenant SaaS code interpreters.
WebAssembly (WASM) runtimes
Compile or interpret guest code inside a WASM VM with explicit host imports. Strong memory isolation and fast startup; limited stdlib unless you bundle one. Good for JavaScript, Rust, or Python subsets (Pyodide) when you control the guest toolchain.
Remote managed sandboxes (E2B, Modal, cloud code-runner APIs)
Outsource isolation to a vendor that maintains hardened images and scaling. Trade operational burden for vendor lock-in, data residency review, and per-minute billing. Often the fastest path when you are not ready to operate container pools yourself.
Language-specific jails (nsjail, bubblewrap, isolate)
Single-binary wrappers around unshare, namespaces, and rlimits.
Lower overhead than full Docker for competitive-programming-style runners; you still
maintain images and syscall policies.
Resource governors and kill switches
Every sandbox session needs hard envelopes enforced by the orchestrator, not by trusting the guest language runtime:
- Wall-clock timeout — typically 5–30 seconds for interactive PAL; longer for batch analytics with explicit user consent.
- CPU quota — cgroup
cpu.maxor equivalent; prevents tight loops from pinning a core. - Memory cap — OOM-kill the sandbox before the host thrashes; 256 MB–1 GB is common for pandas workloads.
- Output size limit — truncate stdout/stderr beyond N kilobytes before it floods the agent context.
- Process count — block fork bombs via
pids.maxor seccomp. - Filesystem quota — tmpfs with fixed size; no persistent volumes unless explicitly mounted read-only.
Log kill reason (timeout, OOM, policy violation) with the code hash and session ID for tuning and incident response.
Filesystem and network policy
Default-deny is the only sustainable posture:
- Filesystem — writable scratch directory only; no access
to
/etc, host mounts, or cloud metadata paths (169.254.169.254). - Network — block all egress unless the use case requires package install (prefer pre-baked images) or allowlisted HTTPS endpoints via a sidecar proxy that strips cookies and internal headers.
- Import allowlist — pre-install numpy, pandas, sympy in
the image; reject
import os,subprocess,socket,ctypesat static analysis time before execution. - Secrets — never inject API keys into the sandbox environment; pass only the minimal data the snippet needs as stdin or a read-only JSON file.
Static analysis (AST walk) catches obvious imports; runtime seccomp catches syscalls the analyzer missed. Use both.
Harbor Analytics PAL refactor (worked example)
Harbor's stress-test copilot previously called exec() in the
FastAPI worker with a 10-second signal.alarm timeout. Problems:
shared process memory, visible env vars, no network block, and a hung C extension
could ignore SIGALRM.
Refactor steps:
- Image hardening — Python 3.12 slim plus numpy, pandas, scipy only; no pip at runtime.
- Runner service — sidecar pool of gVisor-backed containers; one execution per container; destroy after each run.
- Pre-flight AST gate — reject imports outside allowlist and
any
__dunder__attribute access on modules. - Data injection — portfolio CSV written to
/tmp/input.csvinside sandbox; no DB connection strings anywhere in the guest environment. - Result envelope — capture stdout, stderr, exit code, and peak memory; return structured JSON to the ReAct loop as a tool observation.
- Observability — metric
sandbox_kill_reasontagged by reason; alert on OOM rate > 2% of sessions.
Outcome: median execution latency rose 180 ms (container warm pool), but security review passed without exceptions and API-tier OOM incidents went to zero.
Technique decision table
| Approach | Best for | Latency | Isolation strength | Ops burden |
|---|---|---|---|---|
In-process eval / RestrictedPython |
Local demos, trusted operators only | < 10 ms | Weak | Low |
| WASM (Pyodide, QuickJS) | Browser-side or edge math; limited stdlib | 50–200 ms | Strong memory; host imports define ceiling | Medium |
| Container per run (Docker/gVisor) | Multi-tenant SaaS PAL and data agents | 200 ms–2 s cold; < 300 ms warm | Strong | High |
| Managed sandbox API | Fast MVP, variable load | Vendor-dependent | Strong (vendor SLO) | Low (vendor) |
| No code execution (pure function calling) | Fixed operations with typed APIs | Lowest | Strongest (no arbitrary code) | Low |
Prefer function calling when operations are enumerable; reach for sandboxes when the model must compose novel data transforms or statistical code you cannot pre-declare as tools.
Common pitfalls
- Trusting prompt rules alone — “Only use pandas”
does not stop
import os; enforce in AST and seccomp. - Reusing containers without reset — filesystem and global state leak between users; one container per tenant session minimum.
- Mounting the Docker socket — guest code can escape to the host; use a dedicated runner service.
- Allowing pip install at runtime — opens network egress and supply-chain risk; bake dependencies into images.
- Returning raw tracebacks to users — may expose paths and library versions; sanitize and log server-side.
- No concurrency cap — an agent loop can fork hundreds of sandboxes; queue with backpressure.
- Ignoring side channels — timing and memory usage can leak information across co-hosted sandboxes on weak isolation; use microVMs for high-sensitivity workloads.
- Sandbox without output validation — executed code can still produce malicious strings that become indirect injections in the next turn.
Production checklist
- Never execute model-generated code in the API process on multi-tenant paths.
- Enforce wall-clock, CPU, memory, and output-size limits per execution.
- Run static import analysis before launching the sandbox.
- Use default-deny network egress; allowlist only when required.
- Provide data via injected files or stdin, not live database handles.
- Destroy or fully reset isolation boundary after each run.
- Pool warm sandboxes to meet interactive latency SLOs.
- Log kill reason, duration, and memory peak for every execution.
- Rate-limit sandbox spawns per user and per agent session.
- Sanitize stdout before inserting into LLM context or user-visible chat.
- Red-team with escape attempts (env read, fork bomb, metadata curl) in CI.
- Document data residency for managed sandbox vendors in your security packet.
Key takeaways
- Model-generated code is untrusted input — treat it like running a user-uploaded binary, not like executing your own scripts.
- Container or microVM isolation with default-deny network and import allowlists is the production baseline for PAL and code-interpreter agents.
- Harbor Analytics eliminated credential leaks by moving PAL execution off the API tier into ephemeral gVisor containers with structured result envelopes.
- Prefer typed function calling when operations are fixed; sandboxes earn their complexity when the model must compose novel analytics code.
- Sandboxing contains blast radius — pair it with prompt-injection defenses, tool error handling, and loop termination for defense in depth.
Related reading
- LLM program-aided language explained — PAL pattern, sympy helpers, and when code beats chain-of-thought
- Prompt injection explained — direct and indirect attacks on LLM apps
- LLM tool error handling explained — structured observations and recovery after failed executions
- LLM guardrails explained — input filters, output policies, and safe agents