Guide

Context engineering explained

Prompt engineering asks how to phrase an instruction. Context engineering asks a harder question: what information should the model see, in what order, and how much of it fits inside the budget? Every production chatbot, RAG pipeline, and autonomous agent is really a context assembler — stitching system policies, retrieved documents, tool definitions, user history, and fresh user input into one token sequence the model attends over once. Get the assembly wrong and the best prompt wording cannot save you: critical facts sit in the middle where models lose attention, stale conversation history drowns out fresh retrieval, or a 200-line tool schema eats half the window. This guide explains context layers, placement heuristics, memory tiers, compression trade-offs, how context engineering differs from in-context learning and RAG, a Harbor Support agent context stack worked example, an approach decision table, common pitfalls, and a practitioner checklist alongside our agent memory guide.

What context engineering is

Context engineering is the discipline of designing the full input package an LLM receives at inference time — not just the final user message. In API terms, it spans everything that becomes tokens before generation: system messages, developer instructions, retrieved passages, structured metadata, prior turns, tool results, and safety preambles.

The goal is reliable behavior under constraints. Models have finite context windows; latency and cost scale with token count; and attention is uneven across long prompts. Context engineering treats the window as scarce real estate and allocates it deliberately rather than dumping every available string into the prompt and hoping attention finds the right needle.

How it differs from adjacent practices

  • Prompt engineering — wording, tone, output format instructions, chain-of-thought triggers. Context engineering decides which blocks exist; prompt engineering polishes how each block reads.
  • RAG — retrieval is one source of context. Context engineering covers retrieval plus placement, deduplication, citation formatting, and what to drop when retrieval returns too much.
  • Fine-tuning — bakes knowledge into weights. Context engineering keeps knowledge external and updatable without retraining; the trade-off is window limits and assembly complexity.
  • Agent memory — long-horizon storage across sessions. Context engineering defines what memory gets promoted into the active window each turn.

The standard context stack

Most production agents follow a layered stack. Order matters because models often weight early and late positions more heavily than the middle — the lost-in-the-middle phenomenon documented in retrieval research.

Recommended layer order (outside-in)

  1. System / developer policy — role, safety rules, output schema, refusal boundaries. Keep stable across turns so prompt caching can reuse the prefix.
  2. Tool and function definitions — JSON schemas, allowed actions, parameter constraints. Trim descriptions; split rarely-used tools behind a router model if the schema exceeds ~2k tokens.
  3. Retrieved knowledge — RAG chunks, knowledge-base articles, policy excerpts. Tag each chunk with source ID and recency for citation and debugging.
  4. Episodic / session memory — summarized prior turns, user preferences extracted earlier, open ticket state. Prefer structured bullet summaries over raw chat logs.
  5. Few-shot demonstrations — when using in-context learning, place examples after policies and retrieval so the model knows constraints before pattern-matching.
  6. Current user message — the freshest signal; placing it last mirrors natural dialogue and keeps the model focused on the latest intent.

When the stack overflows the budget, drop in reverse priority: trim oldest chat turns first, then reduce retrieved chunk count, then shorten tool docs — never silently truncate the system policy block.

Token budgeting and compression

Context engineering is budgeting. Before each inference call, estimate tokens per layer (use the model tokenizer, not character counts) and reserve headroom for the model’s reply — typically 20–40% of the window for multi-step agents.

Compression techniques

  • Recursive summarization — compress older turns into a rolling summary stored in episodic memory; inject only the summary plus the last N verbatim turns.
  • Chunk selection — rerank retrieved passages with a cross-encoder ( reranking guide ) instead of stuffing top-k vector hits.
  • Structured extraction — replace a 2,000-token PDF excerpt with a 200-token JSON fact table when the task needs fields, not prose.
  • Tool result pruning — agents that call APIs should pass distilled results forward (IDs, status, key fields) rather than full JSON payloads on every subsequent turn.
  • Semantic caching — skip re-injecting identical retrieval for repeated queries ( semantic caching ).

Compression always trades fidelity for space. Log what was dropped and measure answer quality when you tighten budgets — observability should capture per-layer token counts.

Placement heuristics that actually matter

Research and production experience converge on a few reliable heuristics:

  • Put must-follow rules at the top — safety policies, legal disclaimers, and output format requirements belong in the system block, repeated nowhere else to avoid contradiction.
  • Put the question near the bottom — the current user query and any task-specific instructions should appear immediately before the model generates, reducing distraction from middle clutter.
  • Duplicate critical facts at both ends for long RAG — if you must include a 4k-token policy doc, add a one-line recap of key constraints after retrieval and again before the user message.
  • Separate sources visually — use clear delimiters (--- Source: KB-1042 ---) so the model can attribute claims and you can debug retrieval mistakes.
  • Avoid contradictory layers — if the system prompt says “never guess account balances” but retrieved chunks contain outdated balance text, the model may still hallucinate a number. Refresh or filter retrieval when structured data exists.
  • Stable prefixes for caching — keep system + tools + static few-shots byte-identical across requests to maximize prefix cache hits on providers that support it.

Context engineering for agents vs single-shot chat

Single-turn Q&A apps assemble context once. Agents reassemble every loop iteration — and each tool call appends new tokens. Agent context engineering adds:

  • Per-step budgets — cap tool-result history; archive steps older than K into a running “mission summary.”
  • State outside the window — ticket IDs, file paths, and database keys live in a sidecar store; only human-readable summaries enter the prompt.
  • Tool routing context — a lightweight classifier or small model chooses which tool subset to inject, avoiding 15-tool schemas when the user only asked for a refund status check.
  • Verification context — when using test-time compute, each verifier pass needs a trimmed context with only claims to check, not the full agent trace.

Multi-agent systems ( orchestration guide ) compound the problem: handoffs should pass structured handoff packets, not entire sub-agent transcripts.

Worked example: Harbor Support agent context stack

Harbor Support routes billing and shipping tickets for a fictional logistics company. Each agent turn assembles context as follows (128k window model, target 12k input tokens, 4k reserved for output and tool loops):

  1. System block (1,100 tokens, cached) — role (“Harbor Support tier-1 agent”), refusal rules (no legal advice, no raw credit-card storage), output JSON schema for ticket updates, escalation triggers.
  2. Tool schemas (900 tokens, cached subset) — only lookup_order, issue_refund, and escalate_ticket for billing intents; shipping tools omitted until intent classifier scores > 0.7 for logistics.
  3. Retrieved KB (2,400 tokens) — top 4 chunks after hybrid search + rerank on the user’s order ID and issue keywords; each chunk prefixed with [KB-####] and published date.
  4. CRM snapshot (350 tokens) — structured JSON rendered as bullets: customer tier, open tickets, last refund date. Pulled from API, not from stale RAG.
  5. Session summary (450 tokens) — rolling summary of prior turns in this chat; last 2 user/assistant turns kept verbatim.
  6. Current message (variable) — user text plus any uploaded attachment captions.

Overflow policy: if retrieval + history exceeds 5k tokens, drop oldest KB chunk first, then compress session summary, never drop CRM snapshot for refund flows. Quality check: weekly sample of 50 traces in LangSmith with per-layer token tags; flag tickets where the model cited a KB chunk that was dropped by the budget guard.

Approach decision table

ProblemContext strategyWhy
Model ignores retrieved docsRerank + fewer chunks + query at bottomReduces middle noise; puts task in high-attention zone.
Long multi-turn chat degradesRolling summary + last N verbatim turnsPreserves recent detail without unbounded history.
Tool schema too largeIntent-based tool routingInject only relevant 2–4 tools per turn.
Stale factual answersStructured CRM/API snapshot over RAG proseLive data beats embedded documentation for volatile fields.
High cost per requestStable cached prefix + semantic cacheSystem/tools/few-shots reused; identical queries skip retrieval.
Needs worked examplesICL block after policies, before user msgModel learns format within constraint envelope.
Cross-session personalizationVector memory retrieve top-3 facts into episodic layerLong-term prefs without full transcript replay.
Regulated attributionSource-tagged chunks + cite-or-abstain rule in systemTraceable answers for compliance review.

Common pitfalls

  • Treating RAG as dump-and-pray — retrieving 20 chunks guarantees middle-position facts get ignored; quality beats quantity.
  • Duplicating instructions — the same rule in system, developer, and user messages creates conflict when versions drift.
  • Raw chat logs forever — unbounded history pushes retrieval out of the window and increases cost linearly.
  • Full tool JSON in every turn — agents accumulate megabyte traces; distill between steps.
  • Ignoring tokenizer reality — markdown tables and code blocks tokenize differently than prose; budget with real counts.
  • No observability per layer — without token tags you cannot tell whether failures are retrieval, placement, or wording.
  • Contradictory context sources — KB says 30-day returns, system says 14-day; models average policies unpredictably.
  • Over-relying on huge windows — 200k context does not fix attention skew; assembly discipline still wins on accuracy and cost.

Practitioner checklist

  • Document the layer order and overflow drop policy for each product surface.
  • Measure tokens per layer on a representative trace set (p50 and p95).
  • Reserve explicit output headroom; never fill 100% of the window on input.
  • Tag every retrieved chunk with source ID and timestamp.
  • Keep system + static tools in a cacheable prefix; version it in git.
  • Implement rolling session summaries after turn 6–8.
  • Rerank retrieval before injection; default to 3–6 chunks, not 20.
  • Route tools by intent when schema exceeds ~1,500 tokens.
  • Log dropped layers when budget guards fire; alert if drop rate spikes.
  • A/B test placement changes (query at bottom vs top) on grounded QA sets.

Key takeaways

  • Context engineering designs the full token package — not just prompt wording.
  • Layer order matters: policy and tools early, retrieval and memory mid-stack, current query last.
  • Token budgeting is mandatory; compression trades fidelity and must be measured.
  • Agents need per-step pruning and external state; context is reassembled every loop.
  • Pair assembly discipline with observability so you know which layer failed when answers go wrong.

Related reading