Guide

LLM program-aided language explained

Harbor Analytics’ risk desk asked an internal copilot: “If the 10-year yield rises 75 bp and credit spreads widen 40 bp, what happens to our $420M balanced sleeve’s one-year 95% CVaR and sector weights?” A plain chain-of-thought answer hallucinated a CVaR of −8.2% and mis-summed sector exposures by $14M. Switching to program-aided language (PAL) changed the workflow: the model wrote a 38-line Python script that loaded the sleeve CSV, applied duration/convexity shocks per bucket, ran a Monte Carlo with 10,000 paths, and printed {"cvar_95": -0.0614, "top_sector_shift_bps": {...}}. A sandbox executed the code in 1.4 seconds; the model read stdout and composed the narrative answer. CVaR matched the desk’s internal model within 3 bp. Accuracy improved because arithmetic and simulation left the token stream and ran in a deterministic interpreter — not because the base model got smarter overnight.

Program-aided language is a reasoning pattern where the LLM generates executable code (usually Python) that an external runtime evaluates. The model plans the solution, delegates numeric and logical work to code, and interprets the execution output for the user. PAL predates modern function calling but solves a overlapping problem: reliable computation without brittle mental math in autoregressive text. This guide covers the PAL loop, prompt templates, sandbox design, libraries (NumPy, pandas, sympy), the Harbor Analytics refactor, a technique decision table versus CoT and tool-use agents, pitfalls, and a production checklist.

What program-aided language is

In standard prompting, the model predicts the next token including digits and operators. Multi-step arithmetic compounds error: a single transposed digit in step three poisons the final answer, and the model cannot backpropagate through its own scratch work. PAL sidesteps this by treating the LLM as a program synthesizer, not a calculator.

The canonical PAL loop (Gao et al., 2022) has three stages:

  1. Natural-language understanding — parse the user question into variables, constraints, and the computation needed.
  2. Code generation — emit a self-contained script that assigns inputs, performs operations, and prints or returns a structured result.
  3. Answer formation — feed execution output back to the model (or template-fill) to produce a human-readable response citing the computed values.

Unlike open-ended agent tool loops, classic PAL is often a single-shot code write + execute pattern focused on math, date logic, combinatorics, and tabular transforms. Modern systems merge PAL with function calling: the “run_python” tool is one of several available actions.

The PAL execution loop

Production PAL pipelines share a common skeleton:

User question
    → LLM (system: "Write Python, print JSON result")
    → Extract code block
    → Sandbox execute (timeout, memory cap)
    → stdout / stderr / exit code
    → LLM (optional): interpret output → final answer

Prompt structure

Effective PAL prompts specify:

  • Language and version — “Python 3.11, standard library plus numpy and pandas available.”
  • Output contract — “Print exactly one JSON object on the last line” or “Set variable answer to a float.”
  • No side effects — “Do not read files outside /data/, do not network, do not subprocess.”
  • Show work in comments — aids debugging when execution fails.

Example user turn for a word problem:

A store marks up wholesale cost by 22%, then applies a 15% discount.
Wholesale is $47.50. What is the final price?
Write Python. Print {"final_price": <float>} as the last line.

Expected generated code:

wholesale = 47.50
marked = wholesale * 1.22
final = marked * (1 - 0.15)
print({"final_price": round(final, 2)})

When to re-prompt vs self-correct

If execution raises an exception or stdout is empty, append the traceback to the conversation and ask the model to fix the script — similar to self-refine but with executable feedback instead of a prose critic. Cap retries at two or three; persistent failure should fall back to human review or a simpler heuristic.

Sandbox design and security

PAL is only safe if code runs in an isolated environment. Never eval() model output in your application process.

Isolation layers

  • Process sandbox — Docker, gVisor, or Firecracker VM per execution; destroy container after timeout.
  • Resource limits — CPU quota, 256–512 MB RAM, 5–30 s wall clock; kill on exceed.
  • Filesystem — read-only base image; mount user data read-only at a fixed path; no write except /tmp.
  • Network — disabled by default. If a tool needs HTTP, whitelist domains in a separate audited tool, not arbitrary requests.get in PAL code.
  • Import allowlist — block os, subprocess, socket, ctypes; permit math, datetime, json, numpy, pandas, sympy as needed.

Determinism and audit

Log the full prompt, generated source, stdout, stderr, and runtime for every execution. Financial and compliance use cases require reproducible reruns: pin library versions in the sandbox image and seed random number generators explicitly in generated code when Monte Carlo is involved.

Libraries and problem types

Match the sandbox stack to the question domain:

  • Arithmetic and algebra — plain Python or decimal.Decimal for currency; sympy for symbolic manipulation.
  • Tables and time seriespandas for CSV joins, rolling windows, groupby aggregations.
  • Statistics and simulationnumpy vectorization; explicit random.seed() for reproducible draws.
  • Date and calendar logicdatetime, dateutil; never trust the model to count leap years in prose.
  • Combinatorics and search — small brute-force loops are fine; cap iteration counts in the prompt for NP-style problems.

Pre-load common helper modules in the sandbox image so generated imports succeed consistently. Document available packages in the system prompt to reduce import errors.

Harbor Analytics refactor

Before PAL, Harbor’s internal “Ask Risk” bot used CoT for ad hoc portfolio what-if questions. Analysts reported ~30% numeric disagreement with Excel for multi-leg shock scenarios. The refactor:

  1. Classify intent — route questions containing numbers, percentages, “what if”, CVaR, or sector weights to the PAL path; pure policy Q&A stays on RAG.
  2. Inject data context — mount the latest sleeve snapshot CSV read-only; prompt includes schema (sector, mv, duration, convexity).
  3. Generate + execute — model writes shock script; sandbox returns JSON metrics.
  4. Validate bounds — post-processor rejects CVaR > 0 or sector weights summing away from 1.0; triggers one repair attempt.
  5. Narrate with citations — final turn must quote printed JSON fields; UI shows “Computed via Python” expandable source.

Result: numeric disagreement dropped below 2% on a 50-question golden set; p95 latency rose 800 ms (sandbox cold start) — acceptable for internal risk chat, not for sub-100 ms autocomplete. The team kept CoT for explanatory prose and PAL only for quant paths.

Technique decision table

Scenario Prefer PAL Prefer alternative
Multi-step arithmetic, percentages, unit conversion Yes — deterministic interpreter CoT fails on compound operations
Aggregations over uploaded CSV / JSON Yes — pandas in sandbox RAG alone cannot sum columns
Calling your order API or calendar No Function calling with typed handlers
Subjective summary of a long document No RAG or context compression
Must return strict JSON schema to downstream Partial — code prints JSON, then validate Grammar-constrained decoding or structured outputs
Open-ended multi-tool agent (search + code + email) PAL as one tool inside loop Full agent orchestration (LangGraph, etc.)
Latency < 200 ms, simple two-number add No — sandbox overhead dominates Client-side calc or tiny typed function

Common pitfalls

  • Running unsandboxed code — model-generated os.system is a remote-code-execution vulnerability; always isolate.
  • Trusting stdout format — parse JSON with a strict parser; do not regex-scrape free-form prints.
  • Floating-point money — prompt for Decimal or integer cents; binary float surprises on $0.1 + $0.2.
  • Stale sandbox data — mount snapshot timestamp in the prompt so the model does not assume live prices.
  • Over-using PAL — routing every question through Python adds cost and latency; classify first.
  • Silent fallback to CoT — when sandbox fails, do not guess; show error state to the user.
  • Unbounded Monte Carlo — model may write 10M iterations; cap in prompt and kill by CPU time.
  • Leaking secrets into prompts — never embed API keys for the model to paste into generated code.

Production checklist

  • Define routing rules: which intents trigger PAL vs RAG vs function calling.
  • Document allowed imports, data mount paths, and output format in the system prompt.
  • Deploy sandbox with network off, resource limits, and ephemeral containers.
  • Extract code from fenced blocks; reject multi-file or binary payloads.
  • Parse and validate structured stdout before the narrative final turn.
  • Implement retry-on-traceback with a max attempt count.
  • Log source, stdout, stderr, and runtime for audit and debugging.
  • Build a golden set of numeric questions with expected outputs; regression-test weekly.
  • Display provenance in UI (“Answer computed by executed Python”).
  • Monitor sandbox escape attempts, timeout rate, and cost per PAL request.

Key takeaways

  • PAL lets LLMs write code that a sandbox executes — offloading arithmetic and logic to a deterministic runtime.
  • Harbor Analytics cut numeric errors on portfolio what-ifs by routing quant questions through Python instead of chain-of-thought.
  • Security is non-negotiable: isolated containers, import allowlists, no network, strict timeouts.
  • Use PAL for math, tables, and simulation; use function calling for live APIs and RAG for document QA.
  • Validate stdout structurally and show computation provenance — never let the model silently revert to guessing.

Related reading