Guide

Prompt engineering explained: writing effective LLM prompts

Prompt engineering is the craft of writing instructions, examples, and constraints so a large language model (LLM) produces useful, consistent output. It is not magic wording — it is interface design for a probabilistic system. A well-structured prompt can make a smaller, cheaper model outperform a larger one on a narrow task. A vague prompt wastes tokens, hallucinates facts, and breaks downstream parsers. This guide covers the core techniques — zero-shot and few-shot prompting, chain-of-thought, system prompts, structured output, temperature tuning, and how to evaluate prompts before you ship them to users.

What prompt engineering is (and is not)

Every LLM call is a conditional text completion: the model reads your prompt and predicts the most likely continuation. Prompt engineering shapes that continuation by giving the model role, task, format, and boundaries. It is not a substitute for retrieval when facts change daily (use RAG for grounded answers), and it is not security (see prompt injection defenses separately). Think of prompts as API contracts: they define inputs, expected outputs, and failure modes.

Production teams version prompts like code. When a model upgrade shifts behavior, you re-run eval suites — not guess from one chat session. The techniques below stack: start with a clear instruction, add examples if accuracy is low, add reasoning steps for math and logic, then constrain output format for parsers.

Core prompt structure

Most reliable prompts separate concerns into four blocks. You do not always need all four, but skipping them without reason invites drift:

Role and context — who the model is and what domain applies ("You are a technical editor reviewing API docs").
Task — one primary verb and object ("Summarize the following release notes in three bullet points").
Constraints — length, tone, things to avoid, citation rules ("Do not invent CVE numbers; say 'unknown' if absent").
Output format — plain prose, Markdown, JSON schema, or a numbered list the app can regex-parse.

Put the task near the end of the instruction block, immediately before the user content. Models pay more attention to recent tokens; burying the actual ask under paragraphs of background reduces compliance. For long documents, use delimiters — triple quotes, XML tags, or markdown headings — so the model knows where instructions end and data begins.

Zero-shot prompting

Zero-shot means you describe the task with no worked examples. Modern instruction-tuned models handle many zero-shot tasks well: classification, translation, tone adjustment, simple extraction. Zero-shot works when the task matches pretraining distribution — "translate English to Spanish" needs no demo.

Zero-shot fails when the output shape is unusual or the label set is proprietary. If you need sentiment labels like escalate, monitor, close instead of positive/negative, the model has no prior anchor. Either define each label in one sentence or switch to few-shot examples.

When zero-shot is enough

Single-step transformations with obvious success criteria.
Tasks where wrong format is cheap to detect and retry.
Exploratory chat where users tolerate variation.

Few-shot prompting

Few-shot prompts include one to five input/output pairs before the real input. Examples teach label semantics, formatting, and edge-case handling faster than paragraphs of rules. Order matters: put the most representative examples first; include at least one tricky case (empty input, ambiguous phrasing) if your production traffic has them.

Each example costs tokens. Five long examples can consume a significant fraction of your context window, leaving less room for user content. Prefer concise pairs; use dynamic example selection (retrieve the k most similar solved cases from a library) when tasks vary widely. Avoid contradictions — if example 2 uses JSON and example 4 uses prose, the model will flip a coin.

Chain-of-thought and step-by-step reasoning

Chain-of-thought (CoT) asks the model to show intermediate reasoning before the final answer. Phrases like "think step by step" or explicit numbered steps improve accuracy on math, multi-hop logic, and policy checks — at the cost of longer outputs and higher latency. For user-facing apps, you can hide the scratchpad: instruct the model to reason inside <thinking> tags and return only the conclusion outside them.

CoT is not free reliability. Models can produce plausible-sounding chains that still conclude incorrectly ("confident reasoning, wrong answer"). For high-stakes decisions, combine CoT with tool use (calculator, code execution, database lookup) or a second verification pass. In agent architectures, the planner step often uses CoT while tool calls ground facts.

System prompts vs user messages

Chat APIs split prompts into system, user, and sometimes assistant roles. The system message sets persistent behavior: persona, safety boundaries, default format, and tools available. User messages carry the variable task. Keep system prompts stable and short; churning them per request makes A/B testing harder.

Good system prompts are specific and testable: "Answer in under 120 words" beats "be concise." State what to do when information is missing ("say you don't know") rather than only listing prohibitions. If you inject retrieved documents into the user turn, tell the system message how to treat them: cite by title, prefer retrieved text over parametric knowledge, refuse when context is empty.

Structured output and JSON mode

Apps that call functions, write to databases, or render UI need predictable structure. Options include:

JSON schema in the prompt — describe fields, types, and enums; ask for JSON only, no preamble.
Native structured output / JSON mode — API features that constrain decoding to valid JSON (reduces but does not eliminate schema drift).
Tool / function calling — model emits a typed function invocation your runtime executes.

Always validate with a parser (JSON.parse, Zod, Pydantic) and retry with a repair prompt on failure ("your last response was invalid JSON; fix keys X and Y"). Never trust the model to respect numeric ranges or enum sets without server-side validation. For nested objects, show a minimal valid example in the prompt — smaller models mimic shape better than they infer schema from prose descriptions alone.

Temperature, top-p, and determinism

Temperature scales randomness in token sampling. Low temperature (0–0.3) favors the highest-probability tokens — use for extraction, classification, code generation, and anything parsed mechanically. Higher temperature (0.7–1.0) increases variety — use for brainstorming, marketing copy, and creative drafts. Top-p (nucleus sampling) limits choices to a cumulative probability mass; it interacts with temperature and is often set once per task type.

For reproducible evals, fix temperature to 0 and record the model version string. "Same prompt, different day" can shift if the provider silently updates weights. Pin model IDs in config and re-run golden tests after upgrades. When you need diverse samples (e.g., generate five taglines), raise temperature but keep structure fixed via schema.

Evaluation habits that scale

Prompt engineering without measurement becomes folklore. Build a small eval set — 20–200 real inputs with human-reviewed expected outputs or rubric scores. Track pass rate, format validity, latency, and token cost per task. Regression-test prompts in CI the same way you test code: a drop in exact-match accuracy blocks deploy.

LLM-as-judge — a second model scores answers against criteria; cheap for subjective quality, risky if uncorrelated with human judgment.
Golden snapshots — store expected JSON for stable tasks; diff on change.
Failure buckets — tag errors (format, hallucination, refusal, tone) to guide the next prompt iteration.

When accuracy plateaus, consider fine-tuning for a fixed format or domain dialect — but exhaust prompt and retrieval fixes first; they are cheaper and reversible.

Common mistakes

Combining five tasks in one prompt — split into a pipeline of single-purpose calls.
Negative-only constraints ("don't hallucinate, don't be verbose") without positive instructions.
Examples that disagree with written rules — models follow examples over prose.
Assuming the model remembers earlier chat turns beyond the context window.
Shipping prompts edited only in a playground, never version-controlled.
Using high temperature for structured extraction, then blaming the parser.
Omitting "if unknown, say so" — models fill gaps with confident fiction.

Starter checklist

One clear task verb; constraints and output format explicit.
Delimiters around untrusted user content.
Few-shot examples if labels or format are non-standard.
Temperature 0 for machine-readable outputs; validate with a schema.
System prompt versioned in git; model ID pinned in config.
Eval set with pass-rate threshold before production changes.
RAG or tools for factual queries; prompts alone do not guarantee truth.