Guide

Ollama fundamentals explained

Cloud LLM APIs are convenient until a ticket batch hits your rate limit at 2 a.m., or legal asks where customer prompts are stored. Ollama is an open-source runtime that downloads quantized model weights, serves them on localhost (or a private server), and exposes an OpenAI-compatible HTTP API so existing SDKs mostly work unchanged. It sits at the intersection of small language models, quantization, and on-prem privacy — not a training framework like PyTorch, but the fastest path from “we need a local model” to a running endpoint. This guide covers installation, the model library, Modelfiles, inference parameters, API integration, GPU and RAM planning, a Harbor Support on-prem triage router worked example, a tooling decision table, common pitfalls, and a practitioner checklist.

What Ollama is (and is not)

Ollama is a local LLM inference server and CLI. You run ollama pull llama3.2, then ollama run llama3.2 for an interactive chat, or point your application at http://localhost:11434 for programmatic access. Under the hood it uses llama.cpp-style GGUF weights, automatic GPU offloading when CUDA/Metal is available, and a small daemon that keeps models loaded in memory for low-latency repeat requests.

It is not a model trainer, a vector database, or a hosted MLOps platform. Fine-tuning still happens elsewhere; you import the resulting GGUF or use a community model from the Ollama library. Complex agent graphs belong in LangGraph or CrewAI — Ollama is the model endpoint those frameworks call. Reach for Ollama when latency to a public API is unacceptable, data cannot leave your network, or you want predictable per-token cost on hardware you already own.

Core concepts

  • Model — a named artifact (e.g. llama3.2:3b) bundling weights, default template, and parameters.
  • Tag — variant suffix after : — size (7b, 70b), quant level (q4_K_M), or instruct vs base.
  • Modelfile — Dockerfile-like recipe: FROM base, SYSTEM prompt, PARAMETER defaults, TEMPLATE for chat formatting.
  • Daemon — background service on port 11434 handling load/unload and concurrent requests.
  • Context window — max tokens in prompt + completion; larger windows consume more KV cache RAM.

Installation and first run

Install from ollama.com (macOS app, Windows installer, or Linux curl script). On Linux servers without a desktop, the script installs the daemon and CLI; enable it with systemctl enable ollama if your package does not do so automatically. Verify with ollama --version and curl http://localhost:11434 — a running daemon returns Ollama is running.

Essential CLI commands

  • ollama pull <model> — download weights from the library (resumable).
  • ollama list — show local models and sizes on disk.
  • ollama run <model> — interactive REPL; /bye exits.
  • ollama ps — models currently loaded in VRAM/RAM.
  • ollama rm <model> — delete local copy to reclaim disk.
  • ollama show <model> --modelfile — inspect or copy a Modelfile template.

First pull of a 7B Q4 model is roughly 4–5 GB. Plan disk for several variants during evaluation, then prune. On Apple Silicon, Metal acceleration is automatic; on NVIDIA Linux, install matching drivers — Ollama bundles CUDA where needed but cannot fix a missing kernel module.

Choosing models and quantization levels

The Ollama library tags encode size and quantization. A llama3.2:3b fits consumer laptops; qwen2.5:14b needs more VRAM but handles structured extraction better. Quantization trades precision for memory — the same themes as INT4 and GGUF production trade-offs: Q4_K_M is the usual default; Q8_0 when you have headroom and notice quality regression on numerically sensitive tasks.

VRAM and RAM heuristics

  • 7B Q4 — ~5 GB VRAM at 4k context; runs on 8 GB GPUs with tight margins.
  • 13B Q4 — ~8–10 GB; comfortable on 12 GB cards.
  • 70B Q4 — ~40 GB+; often needs multi-GPU or CPU offload (slow).
  • Context scaling — doubling context from 4k to 32k can add several GB of KV cache.
  • CPU fallback — works for dev; production batch jobs need GPU or expect 10–50× slower tokens/sec.

Use ollama ps while serving traffic to see which model stays resident. Ollama unloads idle models after OLLAMA_KEEP_ALIVE (default five minutes) — tune this for bursty vs steady load.

Modelfiles: customize without retraining

A Modelfile layers behavior on top of a base weight. Example for a support triage assistant:

FROM llama3.2:3b
SYSTEM You classify support tickets into billing, technical, or sales.
  Reply with JSON only: {"category": "...", "confidence": 0.0-1.0}
PARAMETER temperature 0.1
PARAMETER num_ctx 4096

Build with ollama create harbor-triage -f Modelfile, then ollama run harbor-triage. Version Modelfiles in git — they are your prompt contract, not the binary weights. Changing SYSTEM or TEMPLATE is instant; swapping FROM to a new base requires re-pulling weights.

Template and stop tokens

Each model family expects a chat template (Llama-3, Mistral, Gemma). Incorrect templates cause the model to ignore system instructions or leak training artifacts. Start from ollama show <base> --modelfile and edit incrementally. Set PARAMETER stop when generations run past the closing tag you need for JSON parsing.

HTTP API and SDK integration

Ollama exposes REST endpoints compatible enough with OpenAI clients that many apps need only a base URL change:

POST http://localhost:11434/api/chat
{
  "model": "harbor-triage",
  "messages": [{"role": "user", "content": "Invoice #8821 charged twice"}],
  "stream": false
}

For OpenAI SDK compatibility, use /v1/chat/completions with base_url="http://localhost:11434/v1" and any placeholder API key. Enable stream: true for token-by-token UX; your reverse proxy must disable buffering on SSE. Non-chat endpoints include /api/embeddings for RAG pipelines using nomic-embed-text or similar embedding models pulled the same way.

Production networking

  • Bind to 127.0.0.1 by default — never expose 11434 to the public internet without auth.
  • Put nginx or Caddy in front with mTLS or API keys for internal services.
  • Set request timeouts above your worst-case generation (long summaries need 120s+).
  • Run one Ollama instance per GPU box; scale horizontally with a load balancer, not multiple daemons fighting for one GPU.

Harbor Support on-prem triage router (worked example)

Harbor Support handled 12,000 tickets/month through a cloud classifier at $0.002/request. Compliance required PII to stay in-region. The team deployed Ollama on a single L4 GPU (24 GB) instance:

  1. Model — custom Modelfile on qwen2.5:7b with JSON-only system prompt and temperature 0.
  2. Routing — FastAPI middleware calls Ollama first; confidence < 0.85 escalates to the cloud frontier model (see model routing).
  3. RAG — billing policy snippets retrieved from Postgres pgvector; injected into user message, not system prompt, to simplify cache.
  4. SLA — p95 local latency 420 ms vs 1.8s cloud; 78% of tickets never left the VPC.
  5. Ops — weekly ollama pull for patch tags; Modelfile version pinned in git; ollama ps alert if model unloads during business hours.
  6. Cost — GPU reserved instance $380/mo vs ~$290/mo API spend at prior volume, break-even improved as volume grew.

The critical design choice was cascade routing: local model handles easy classification; hard cases still get frontier quality. Running only local would have increased misroutes on ambiguous multilingual tickets.

Tooling decision table

NeedOllamavLLM / TGILM StudioCloud API
Fastest laptop dev setupExcellentHeavyExcellentInstant
OpenAI-compatible local APIYesYesYesNative
High-throughput multi-tenant servingModerateExcellentPoorExcellent
Custom Modelfile / prompt packagingNativeConfig filesUI-drivenN/A
Data never leaves networkYesYesYesNo
Embedding + chat same daemonYesSeparateLimitedSeparate SKUs
Frontier model qualitySmall/medium onlyDepends on weightsSameBest

Choose Ollama for developer velocity and small-team on-prem inference. Graduate to vLLM when you need continuous batching across many concurrent users on A100/H100 clusters. Keep cloud APIs for frontier reasoning and as the escalation tier in cascades.

Common pitfalls

  • Undersized GPU — model loads then OOMs mid-request when context grows; monitor ollama ps and cap num_ctx.
  • Wrong chat template — model ignores system prompt; always derive Modelfile from ollama show --modelfile.
  • Exposed port 11434 — unauthenticated inference and model pull; firewall aggressively.
  • No cascade for hard inputs — 7B local models misclassify edge cases; pair with cloud or larger local model.
  • Stale weights — security and quality patches ship as new tags; automate weekly pull + smoke test.
  • Ignoring disk growth — eval pulls accumulate; ollama list monthly cleanup.
  • Streaming through buffering proxies — clients see full delay then burst; disable nginx proxy_buffering on SSE.
  • Expecting fine-tune in Ollama — import custom GGUF or use external training; Ollama is inference-only.

Practitioner checklist

  • Install Ollama on target OS; verify daemon health on port 11434.
  • Estimate VRAM from model size, quant level, and max context.
  • Pull a small instruct model first (llama3.2:3b or qwen2.5:7b).
  • Author Modelfile with SYSTEM, PARAMETER, and correct TEMPLATE.
  • Run ollama create and smoke-test JSON or classification output.
  • Integrate via /api/chat or OpenAI-compatible /v1 client.
  • Put reverse proxy auth in front; never expose daemon publicly.
  • Configure OLLAMA_KEEP_ALIVE for your traffic pattern.
  • Add cascade routing to cloud or larger model for low-confidence cases.
  • Version Modelfiles in git; pin model tags in deployment config.
  • Schedule pull + regression tests when upstream tags update.
  • Monitor tokens/sec, VRAM, and p95 latency under real prompts.

Key takeaways

  • Ollama is the fastest path to local LLM inference with an OpenAI-shaped API.
  • Modelfiles package system prompts and parameters without retraining weights.
  • VRAM planning must include context length and concurrent models, not just parameter count.
  • Production setups use auth, cascade routing, and tagged update discipline.
  • Pair Ollama with RAG and cloud escalation — local is a tier, not always the whole answer.

Related reading