Guide
LLM program-aided language explained
Harbor Analytics’ risk desk asked an internal copilot: “If the 10-year yield
rises 75 bp and credit spreads widen 40 bp, what happens to our $420M balanced sleeve’s
one-year 95% CVaR and sector weights?” A plain
chain-of-thought
answer hallucinated a CVaR of −8.2% and mis-summed sector exposures by $14M.
Switching to program-aided language (PAL) changed the workflow: the model
wrote a 38-line Python script that loaded the sleeve CSV, applied duration/convexity shocks
per bucket, ran a Monte Carlo with 10,000 paths, and printed
{"cvar_95": -0.0614, "top_sector_shift_bps": {...}}. A sandbox executed the
code in 1.4 seconds; the model read stdout and composed the narrative answer. CVaR matched
the desk’s internal model within 3 bp. Accuracy improved because arithmetic and
simulation left the token stream and ran in a deterministic interpreter — not because
the base model got smarter overnight.
Program-aided language is a reasoning pattern where the LLM generates executable code (usually Python) that an external runtime evaluates. The model plans the solution, delegates numeric and logical work to code, and interprets the execution output for the user. PAL predates modern function calling but solves a overlapping problem: reliable computation without brittle mental math in autoregressive text. This guide covers the PAL loop, prompt templates, sandbox design, libraries (NumPy, pandas, sympy), the Harbor Analytics refactor, a technique decision table versus CoT and tool-use agents, pitfalls, and a production checklist.
What program-aided language is
In standard prompting, the model predicts the next token including digits and operators. Multi-step arithmetic compounds error: a single transposed digit in step three poisons the final answer, and the model cannot backpropagate through its own scratch work. PAL sidesteps this by treating the LLM as a program synthesizer, not a calculator.
The canonical PAL loop (Gao et al., 2022) has three stages:
- Natural-language understanding — parse the user question into variables, constraints, and the computation needed.
- Code generation — emit a self-contained script that assigns inputs, performs operations, and prints or returns a structured result.
- Answer formation — feed execution output back to the model (or template-fill) to produce a human-readable response citing the computed values.
Unlike open-ended agent tool loops, classic PAL is often a single-shot code write + execute pattern focused on math, date logic, combinatorics, and tabular transforms. Modern systems merge PAL with function calling: the “run_python” tool is one of several available actions.
The PAL execution loop
Production PAL pipelines share a common skeleton:
User question
→ LLM (system: "Write Python, print JSON result")
→ Extract code block
→ Sandbox execute (timeout, memory cap)
→ stdout / stderr / exit code
→ LLM (optional): interpret output → final answer
Prompt structure
Effective PAL prompts specify:
- Language and version — “Python 3.11, standard library plus numpy and pandas available.”
- Output contract — “Print exactly one JSON object on the
last line” or “Set variable
answerto a float.” - No side effects — “Do not read files outside
/data/, do not network, do not subprocess.” - Show work in comments — aids debugging when execution fails.
Example user turn for a word problem:
A store marks up wholesale cost by 22%, then applies a 15% discount.
Wholesale is $47.50. What is the final price?
Write Python. Print {"final_price": <float>} as the last line.
Expected generated code:
wholesale = 47.50
marked = wholesale * 1.22
final = marked * (1 - 0.15)
print({"final_price": round(final, 2)})
When to re-prompt vs self-correct
If execution raises an exception or stdout is empty, append the traceback to the conversation and ask the model to fix the script — similar to self-refine but with executable feedback instead of a prose critic. Cap retries at two or three; persistent failure should fall back to human review or a simpler heuristic.
Sandbox design and security
PAL is only safe if code runs in an isolated environment. Never eval()
model output in your application process.
Isolation layers
- Process sandbox — Docker, gVisor, or Firecracker VM per execution; destroy container after timeout.
- Resource limits — CPU quota, 256–512 MB RAM, 5–30 s wall clock; kill on exceed.
- Filesystem — read-only base image; mount user data read-only at
a fixed path; no write except
/tmp. - Network — disabled by default. If a tool needs HTTP, whitelist
domains in a separate audited tool, not arbitrary
requests.getin PAL code. - Import allowlist — block
os,subprocess,socket,ctypes; permitmath,datetime,json,numpy,pandas,sympyas needed.
Determinism and audit
Log the full prompt, generated source, stdout, stderr, and runtime for every execution. Financial and compliance use cases require reproducible reruns: pin library versions in the sandbox image and seed random number generators explicitly in generated code when Monte Carlo is involved.
Libraries and problem types
Match the sandbox stack to the question domain:
- Arithmetic and algebra — plain Python or
decimal.Decimalfor currency;sympyfor symbolic manipulation. - Tables and time series —
pandasfor CSV joins, rolling windows, groupby aggregations. - Statistics and simulation —
numpyvectorization; explicitrandom.seed()for reproducible draws. - Date and calendar logic —
datetime,dateutil; never trust the model to count leap years in prose. - Combinatorics and search — small brute-force loops are fine; cap iteration counts in the prompt for NP-style problems.
Pre-load common helper modules in the sandbox image so generated imports succeed consistently. Document available packages in the system prompt to reduce import errors.
Harbor Analytics refactor
Before PAL, Harbor’s internal “Ask Risk” bot used CoT for ad hoc portfolio what-if questions. Analysts reported ~30% numeric disagreement with Excel for multi-leg shock scenarios. The refactor:
- Classify intent — route questions containing numbers, percentages, “what if”, CVaR, or sector weights to the PAL path; pure policy Q&A stays on RAG.
- Inject data context — mount the latest sleeve snapshot CSV
read-only; prompt includes schema (
sector, mv, duration, convexity). - Generate + execute — model writes shock script; sandbox returns JSON metrics.
- Validate bounds — post-processor rejects CVaR > 0 or sector weights summing away from 1.0; triggers one repair attempt.
- Narrate with citations — final turn must quote printed JSON fields; UI shows “Computed via Python” expandable source.
Result: numeric disagreement dropped below 2% on a 50-question golden set; p95 latency rose 800 ms (sandbox cold start) — acceptable for internal risk chat, not for sub-100 ms autocomplete. The team kept CoT for explanatory prose and PAL only for quant paths.
Technique decision table
| Scenario | Prefer PAL | Prefer alternative |
|---|---|---|
| Multi-step arithmetic, percentages, unit conversion | Yes — deterministic interpreter | CoT fails on compound operations |
| Aggregations over uploaded CSV / JSON | Yes — pandas in sandbox | RAG alone cannot sum columns |
| Calling your order API or calendar | No | Function calling with typed handlers |
| Subjective summary of a long document | No | RAG or context compression |
| Must return strict JSON schema to downstream | Partial — code prints JSON, then validate | Grammar-constrained decoding or structured outputs |
| Open-ended multi-tool agent (search + code + email) | PAL as one tool inside loop | Full agent orchestration (LangGraph, etc.) |
| Latency < 200 ms, simple two-number add | No — sandbox overhead dominates | Client-side calc or tiny typed function |
Common pitfalls
- Running unsandboxed code — model-generated
os.systemis a remote-code-execution vulnerability; always isolate. - Trusting stdout format — parse JSON with a strict parser; do not regex-scrape free-form prints.
- Floating-point money — prompt for
Decimalor integer cents; binary float surprises on $0.1 + $0.2. - Stale sandbox data — mount snapshot timestamp in the prompt so the model does not assume live prices.
- Over-using PAL — routing every question through Python adds cost and latency; classify first.
- Silent fallback to CoT — when sandbox fails, do not guess; show error state to the user.
- Unbounded Monte Carlo — model may write 10M iterations; cap in prompt and kill by CPU time.
- Leaking secrets into prompts — never embed API keys for the model to paste into generated code.
Production checklist
- Define routing rules: which intents trigger PAL vs RAG vs function calling.
- Document allowed imports, data mount paths, and output format in the system prompt.
- Deploy sandbox with network off, resource limits, and ephemeral containers.
- Extract code from fenced blocks; reject multi-file or binary payloads.
- Parse and validate structured stdout before the narrative final turn.
- Implement retry-on-traceback with a max attempt count.
- Log source, stdout, stderr, and runtime for audit and debugging.
- Build a golden set of numeric questions with expected outputs; regression-test weekly.
- Display provenance in UI (“Answer computed by executed Python”).
- Monitor sandbox escape attempts, timeout rate, and cost per PAL request.
Key takeaways
- PAL lets LLMs write code that a sandbox executes — offloading arithmetic and logic to a deterministic runtime.
- Harbor Analytics cut numeric errors on portfolio what-ifs by routing quant questions through Python instead of chain-of-thought.
- Security is non-negotiable: isolated containers, import allowlists, no network, strict timeouts.
- Use PAL for math, tables, and simulation; use function calling for live APIs and RAG for document QA.
- Validate stdout structurally and show computation provenance — never let the model silently revert to guessing.
Related reading
- LLM chain-of-thought explained — step-by-step reasoning in tokens
- LLM function calling explained — typed tool schemas and multi-turn loops
- AI agents and tool use explained — orchestrating multiple actions
- LLM self-refine explained — critique loops with executable validators