Guide
Ollama fundamentals explained
Cloud LLM APIs are convenient until a ticket batch hits your rate limit at 2 a.m., or legal asks where customer prompts are stored. Ollama is an open-source runtime that downloads quantized model weights, serves them on localhost (or a private server), and exposes an OpenAI-compatible HTTP API so existing SDKs mostly work unchanged. It sits at the intersection of small language models, quantization, and on-prem privacy — not a training framework like PyTorch, but the fastest path from “we need a local model” to a running endpoint. This guide covers installation, the model library, Modelfiles, inference parameters, API integration, GPU and RAM planning, a Harbor Support on-prem triage router worked example, a tooling decision table, common pitfalls, and a practitioner checklist.
What Ollama is (and is not)
Ollama is a local LLM inference server and CLI. You run
ollama pull llama3.2, then ollama run llama3.2 for an interactive
chat, or point your application at http://localhost:11434 for programmatic
access. Under the hood it uses llama.cpp-style GGUF weights, automatic GPU
offloading when CUDA/Metal is available, and a small daemon that keeps models loaded in
memory for low-latency repeat requests.
It is not a model trainer, a vector database, or a hosted MLOps platform. Fine-tuning still happens elsewhere; you import the resulting GGUF or use a community model from the Ollama library. Complex agent graphs belong in LangGraph or CrewAI — Ollama is the model endpoint those frameworks call. Reach for Ollama when latency to a public API is unacceptable, data cannot leave your network, or you want predictable per-token cost on hardware you already own.
Core concepts
- Model — a named artifact (e.g.
llama3.2:3b) bundling weights, default template, and parameters. - Tag — variant suffix after
:— size (7b,70b), quant level (q4_K_M), or instruct vs base. - Modelfile — Dockerfile-like recipe:
FROMbase,SYSTEMprompt,PARAMETERdefaults,TEMPLATEfor chat formatting. - Daemon — background service on port 11434 handling load/unload and concurrent requests.
- Context window — max tokens in prompt + completion; larger windows consume more KV cache RAM.
Installation and first run
Install from ollama.com
(macOS app, Windows installer, or Linux curl script). On Linux servers without a desktop,
the script installs the daemon and CLI; enable it with systemctl enable ollama
if your package does not do so automatically. Verify with ollama --version and
curl http://localhost:11434 — a running daemon returns Ollama is running.
Essential CLI commands
ollama pull <model>— download weights from the library (resumable).ollama list— show local models and sizes on disk.ollama run <model>— interactive REPL;/byeexits.ollama ps— models currently loaded in VRAM/RAM.ollama rm <model>— delete local copy to reclaim disk.ollama show <model> --modelfile— inspect or copy a Modelfile template.
First pull of a 7B Q4 model is roughly 4–5 GB. Plan disk for several variants during evaluation, then prune. On Apple Silicon, Metal acceleration is automatic; on NVIDIA Linux, install matching drivers — Ollama bundles CUDA where needed but cannot fix a missing kernel module.
Choosing models and quantization levels
The Ollama library tags encode size and quantization. A llama3.2:3b fits
consumer laptops; qwen2.5:14b needs more VRAM but handles structured extraction
better. Quantization trades precision for memory — the same themes as
INT4 and GGUF production trade-offs:
Q4_K_M is the usual default; Q8_0 when you have headroom and notice quality regression on
numerically sensitive tasks.
VRAM and RAM heuristics
- 7B Q4 — ~5 GB VRAM at 4k context; runs on 8 GB GPUs with tight margins.
- 13B Q4 — ~8–10 GB; comfortable on 12 GB cards.
- 70B Q4 — ~40 GB+; often needs multi-GPU or CPU offload (slow).
- Context scaling — doubling context from 4k to 32k can add several GB of KV cache.
- CPU fallback — works for dev; production batch jobs need GPU or expect 10–50× slower tokens/sec.
Use ollama ps while serving traffic to see which model stays resident. Ollama
unloads idle models after OLLAMA_KEEP_ALIVE (default five minutes) — tune
this for bursty vs steady load.
Modelfiles: customize without retraining
A Modelfile layers behavior on top of a base weight. Example for a support triage assistant:
FROM llama3.2:3b
SYSTEM You classify support tickets into billing, technical, or sales.
Reply with JSON only: {"category": "...", "confidence": 0.0-1.0}
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
Build with ollama create harbor-triage -f Modelfile, then
ollama run harbor-triage. Version Modelfiles in git — they are your
prompt contract, not the binary weights. Changing SYSTEM or TEMPLATE
is instant; swapping FROM to a new base requires re-pulling weights.
Template and stop tokens
Each model family expects a chat template (Llama-3, Mistral, Gemma). Incorrect templates
cause the model to ignore system instructions or leak training artifacts. Start from
ollama show <base> --modelfile and edit incrementally. Set
PARAMETER stop when generations run past the closing tag you need for JSON parsing.
HTTP API and SDK integration
Ollama exposes REST endpoints compatible enough with OpenAI clients that many apps need only a base URL change:
POST http://localhost:11434/api/chat
{
"model": "harbor-triage",
"messages": [{"role": "user", "content": "Invoice #8821 charged twice"}],
"stream": false
}
For OpenAI SDK compatibility, use /v1/chat/completions with
base_url="http://localhost:11434/v1" and any placeholder API key. Enable
stream: true for token-by-token UX; your reverse proxy must disable buffering
on SSE. Non-chat endpoints include /api/embeddings for
RAG
pipelines using nomic-embed-text or similar embedding models pulled the same way.
Production networking
- Bind to
127.0.0.1by default — never expose 11434 to the public internet without auth. - Put nginx or Caddy in front with mTLS or API keys for internal services.
- Set request timeouts above your worst-case generation (long summaries need 120s+).
- Run one Ollama instance per GPU box; scale horizontally with a load balancer, not multiple daemons fighting for one GPU.
Harbor Support on-prem triage router (worked example)
Harbor Support handled 12,000 tickets/month through a cloud classifier at $0.002/request. Compliance required PII to stay in-region. The team deployed Ollama on a single L4 GPU (24 GB) instance:
- Model — custom Modelfile on
qwen2.5:7bwith JSON-only system prompt andtemperature 0. - Routing — FastAPI middleware calls Ollama first; confidence < 0.85 escalates to the cloud frontier model (see model routing).
- RAG — billing policy snippets retrieved from Postgres pgvector; injected into user message, not system prompt, to simplify cache.
- SLA — p95 local latency 420 ms vs 1.8s cloud; 78% of tickets never left the VPC.
- Ops — weekly
ollama pullfor patch tags; Modelfile version pinned in git;ollama psalert if model unloads during business hours. - Cost — GPU reserved instance $380/mo vs ~$290/mo API spend at prior volume, break-even improved as volume grew.
The critical design choice was cascade routing: local model handles easy classification; hard cases still get frontier quality. Running only local would have increased misroutes on ambiguous multilingual tickets.
Tooling decision table
| Need | Ollama | vLLM / TGI | LM Studio | Cloud API |
|---|---|---|---|---|
| Fastest laptop dev setup | Excellent | Heavy | Excellent | Instant |
| OpenAI-compatible local API | Yes | Yes | Yes | Native |
| High-throughput multi-tenant serving | Moderate | Excellent | Poor | Excellent |
| Custom Modelfile / prompt packaging | Native | Config files | UI-driven | N/A |
| Data never leaves network | Yes | Yes | Yes | No |
| Embedding + chat same daemon | Yes | Separate | Limited | Separate SKUs |
| Frontier model quality | Small/medium only | Depends on weights | Same | Best |
Choose Ollama for developer velocity and small-team on-prem inference. Graduate to vLLM when you need continuous batching across many concurrent users on A100/H100 clusters. Keep cloud APIs for frontier reasoning and as the escalation tier in cascades.
Common pitfalls
- Undersized GPU — model loads then OOMs mid-request when context grows; monitor
ollama psand capnum_ctx. - Wrong chat template — model ignores system prompt; always derive Modelfile from
ollama show --modelfile. - Exposed port 11434 — unauthenticated inference and model pull; firewall aggressively.
- No cascade for hard inputs — 7B local models misclassify edge cases; pair with cloud or larger local model.
- Stale weights — security and quality patches ship as new tags; automate weekly pull + smoke test.
- Ignoring disk growth — eval pulls accumulate;
ollama listmonthly cleanup. - Streaming through buffering proxies — clients see full delay then burst; disable nginx proxy_buffering on SSE.
- Expecting fine-tune in Ollama — import custom GGUF or use external training; Ollama is inference-only.
Practitioner checklist
- Install Ollama on target OS; verify daemon health on port 11434.
- Estimate VRAM from model size, quant level, and max context.
- Pull a small instruct model first (
llama3.2:3borqwen2.5:7b). - Author Modelfile with SYSTEM, PARAMETER, and correct TEMPLATE.
- Run
ollama createand smoke-test JSON or classification output. - Integrate via
/api/chator OpenAI-compatible/v1client. - Put reverse proxy auth in front; never expose daemon publicly.
- Configure
OLLAMA_KEEP_ALIVEfor your traffic pattern. - Add cascade routing to cloud or larger model for low-confidence cases.
- Version Modelfiles in git; pin model tags in deployment config.
- Schedule pull + regression tests when upstream tags update.
- Monitor tokens/sec, VRAM, and p95 latency under real prompts.
Key takeaways
- Ollama is the fastest path to local LLM inference with an OpenAI-shaped API.
- Modelfiles package system prompts and parameters without retraining weights.
- VRAM planning must include context length and concurrent models, not just parameter count.
- Production setups use auth, cascade routing, and tagged update discipline.
- Pair Ollama with RAG and cloud escalation — local is a tier, not always the whole answer.
Related reading
- LLM quantization explained — what Q4 and GGUF mean for quality and memory
- Small language models explained — which compact models Ollama hosts well
- RAG explained — grounding local models with retrieved context
- LLM model routing explained — cascade local and cloud tiers by confidence