Guide

LLM reasoning models explained

Harbor Analytics’ overnight fraud-scoring pipeline flagged 3.2% of card transactions for manual review. Analysts loved the precision on obvious fraud rings, but multi-hop cases — “merchant changed MCC last week, buyer IP matches a prior chargeback cluster, but velocity is below the 90th percentile” — stumped a standard GPT-4o call even with chain-of-thought prompting. The team routed only the ambiguous 18% of flagged tickets through OpenAI’s o1-preview reasoning model. False-positive rate on that slice dropped from 41% to 19%, and analysts recovered 6.4 analyst-hours per night, at roughly 8× token cost versus the base model. The gain was not “smarter weights” alone — o1 spent hundreds of hidden thinking tokens deliberating before emitting a structured verdict.

Reasoning models (OpenAI o1/o3, DeepSeek-R1, QwQ, and similar) are LLMs trained and served to allocate extra inference compute on internal deliberation — often a hidden chain of thought the user never sees — before producing a final answer. Unlike bolting “think step by step” onto a chat model at prompt time, reasoning-native models learn via reinforcement learning on verifiable outcomes (math checks, unit tests, compiler feedback) to use long scratchpad trajectories productively. This guide covers the training stack, thinking budgets and API parameters, the Harbor Analytics refactor, a technique decision table vs test-time compute overlays and program-aided language, pitfalls, and a production checklist.

What makes a model “reasoning-native”

Three properties distinguish reasoning models from general chat LLMs with CoT prompts:

  1. Extended internal trajectory — the model generates a long hidden reasoning trace (thinking tokens) before the visible completion. APIs expose this as a separate reasoning output block or bill it as reasoning tokens distinct from completion tokens.
  2. RL on verifiable rewards (RLVR) — post-training uses reinforcement learning where the reward is checkable: exact match on math answers, pass/fail on code tests, JSON schema validity, or domain-specific validators. Wrong reasoning paths receive low reward even if the final token looks plausible.
  3. Inference-time scaling baked in — the model is optimized to benefit from more thinking budget. Reducing max thinking tokens hurts hard tasks disproportionately; increasing budget helps up to a plateau — unlike chat models where extra tokens often repeat fluff.

DeepSeek-R1 demonstrated that a strong open-weights reasoning model can emerge from cold-start SFT on chain-of-thought data, followed by large-scale RL with rule-based graders on math and code. OpenAI’s o-series keeps architecture details private but exhibits the same user-visible pattern: latency rises, reasoning-token counts spike on hard prompts, and math, logic, and planning benchmarks jump relative to same-size chat models.

Training pipeline: from CoT data to RLVR

Production reasoning models typically pass through four stages:

1. Base pretraining

Standard next-token prediction on broad corpora. No special sauce yet — the base must already possess world knowledge and syntax.

2. Supervised fine-tuning on long reasoning traces

Curated or distilled chain-of-thought examples teach the model to emit structured deliberation: restate the problem, plan sub-steps, carry intermediate values, self-check before answering. Quality beats quantity; 10k excellent traces can outperform 1M noisy scrapes.

3. Reinforcement learning with verifiable rewards

The model samples many completions per prompt; a grader assigns reward (+1 correct, 0 wrong, partial credit optional). Policy gradients (PPO, GRPO, or variants) push the model toward trajectories that consistently reach correct finals. Process reward models (PRMs) can score each intermediate step, not just the last token — reducing reward hacking where the model guesses a lucky final number.

4. Alignment and safety pass

Reasoning models can hide harmful planning in thinking tokens. Vendors apply refusal training, thinking-token monitoring, and output filters before release. Self-hosted R1-class weights need the same governance you apply to frontier chat models.

Serving reasoning models: budgets, latency, and APIs

Operating reasoning models in production differs from standard chat endpoints:

  • Thinking budget — cap reasoning tokens (e.g., max_completion_tokens split between reasoning and answer on OpenAI o1). Too low: model rushes and accuracy drops on hard tasks. Too high: latency and cost balloon with diminishing returns.
  • Hidden vs visible CoT — most hosted APIs hide thinking content for safety and competitive reasons. You pay for reasoning tokens but cannot debug step-by-step unless the provider exposes summaries. Self-hosted R1 can stream <think> blocks — useful for audit, risky for leaking secrets.
  • Structured outputs — reasoning models pair well with JSON schema constraints on the final answer only; do not schema-bind thinking tokens or you fight the deliberation format.
  • Parallelism limits — thinking is inherently serial; batch throughput per GPU is lower than chat models of similar size. Queue hard reasoning jobs separately from low-latency chat.
  • Temperature — vendors often fix temperature at 1 for o-series; tuning may be unavailable. Plan eval around provider defaults.

Cost model: if a chat call costs $0.003 and o1 costs $0.024 on the same prompt, routing only the 15% of requests that fail a cheap classifier preserves blended unit economics. Log reasoning-token counts per task type to learn where budget pays off.

Harbor Analytics refactor (worked example)

Problem: Fraud review queue mixed rule hits (high precision) with model-escalated edge cases. GPT-4o with explicit CoT prompt reached 76% agreement with senior analysts on the edge slice; disagreements clustered on multi-condition logic and arithmetic on rolling windows.

Architecture:

  1. Tier 0 — deterministic rules (block/allow) handle 82% of volume.
  2. Tier 1 — GPT-4o mini scores risk 0–100 with JSON output; sub-threshold auto-release.
  3. Tier 2 — only scores in the ambiguous band 0.45–0.72 route to o1-preview with reasoning_effort=high and a schema: { verdict, confidence, factor_breakdown[], recommended_action }.
  4. Tier 3 — human analyst if o1 confidence < 0.7 or regulatory flag present.

Prompt design: Tier 2 prompts embed tabular features (30-day velocity, MCC history, device graph distance) as markdown tables — reasoning models excel when facts are explicit. The model must cite which factors drove the verdict in factor_breakdown; analysts use that for audit, not the hidden trace.

Results (30-day A/B): Tier-2 false positives 41% → 19%; tier-2 false negatives 8% → 6%; p95 tier-2 latency 38s (within 60s SLA); blended cost per transaction +$0.0004 versus all-GPT-4o. Analysts reported higher trust because factor breakdowns matched their mental checklists.

Technique decision table

Technique Prefer when Avoid when
Reasoning model (o1, R1) Multi-step logic, math, code, planning; verifiable answer; latency 10–60s OK Creative writing, open-ended chat, sub-second UX, cost-sensitive bulk
CoT prompting on chat model Moderate reasoning; need visible steps; provider has no reasoning tier Chat model still fails after CoT + self-consistency on eval set
Test-time compute overlays Control sampling yourself; mix verifiers and search; any base model Want vendor-managed deliberation; team lacks eval harness for N-sample methods
Program-aided language (PAL) Numeric or symbolic tasks executable in Python Reasoning is qualitative policy synthesis, not calculable
Self-consistency Same chat model, extractable final answer, budget for 5–10 samples Answers are long-form prose; samples too correlated
Mixture of agents Need diverse perspectives synthesized; regulated citation merging Task is pure logic with one correct score; cost of 3+ full models excessive

Common pitfalls

  • Routing everything to o1 — bulk chat and retrieval QA rarely justify 8× cost; use a cascade classifier trained on your failure cases.
  • Assuming visible CoT equals hidden thinking — prompts that worked on GPT-4o may be redundant or harmful on reasoning models; A/B without legacy CoT boilerplate.
  • No thinking budget tuning — default max may be wasteful or too tight; sweep reasoning tokens on a golden set.
  • Ignoring unverifiable domains — RLVR-trained models shine where graders exist; open-ended legal opinions still need RAG and human review.
  • Latency surprise — serial thinking blows p99; set user expectations or async job pattern for tier-2.
  • Leakage via thinking blocks — self-hosted models that stream <think> may expose PII from context; redact or strip before logs.
  • Over-trusting factor JSON — models can hallucinate plausible breakdown fields; cross-check against input features programmatically.
  • Skipping regression when vendor updates — o1-preview to o1 GA changed token economics and behavior; pin versions and re-eval quarterly.

Production checklist

  • Build golden eval set with verifiable labels for your hardest task slice.
  • Benchmark chat model + CoT + self-consistency before adopting reasoning tier.
  • Train or rule-build a router that sends only ambiguous cases to reasoning model.
  • Define JSON or enum schema for final answers; keep thinking unschema-bound.
  • Set reasoning token budget per task type; log cost and latency per tier.
  • Implement timeout fallback to chat model or human queue.
  • Validate factor breakdowns against input data where possible.
  • Monitor false positive/negative rates separately per tier, not blended.
  • Document which data classes may enter reasoning APIs (PII, HIPAA).
  • Re-run eval when provider ships new reasoning model version.

Key takeaways

  • Reasoning models allocate hidden thinking tokens and are trained with RL on verifiable rewards — not just longer CoT prompts.
  • Harbor Analytics cut tier-2 fraud false positives from 41% to 19% by routing only ambiguous scores to o1-preview.
  • Use cascade routing: cheap chat for easy cases, reasoning model for the slice your eval proves is failing.
  • Tune thinking budget and measure reasoning-token spend; defaults are rarely optimal for your SLA.
  • Pair reasoning models with structured final outputs and programmatic checks — not blind trust in eloquent traces.

Related reading