Guide
Prometheus monitoring explained
Prometheus is the open-source time-series database and monitoring system that became the default metrics layer for Kubernetes, microservices, and most cloud-native stacks. Unlike log aggregators that ingest events after the fact, Prometheus pulls metrics from your services on a schedule, stores them as labeled samples, and exposes a powerful query language (PromQL) for dashboards and alerts. Whether you are instrumenting a Node.js API, scraping Linux host stats with the node exporter, or wiring burn-rate alerts to SLO error budgets, Prometheus is usually in the path. This guide covers the pull model and scrape config, the four metric types, labels and cardinality traps, PromQL patterns for rates and percentiles, recording rules and Alertmanager routing, a Harbor Fleet multi-service worked example, a tooling decision table, common pitfalls, and a production checklist. For the broader observability picture, see observability explained; for request paths across services, pair metrics with distributed tracing.
How Prometheus fits the observability stack
Prometheus owns the metrics pillar — numeric measurements aggregated over time: request rates, error ratios, queue depth, memory usage. It does not replace structured logs or distributed traces; it complements them. A typical incident workflow: a Prometheus alert fires on elevated 5xx rate, you open Grafana to slice by endpoint and deployment version, then pivot to traces or logs with a shared correlation ID to find the failing dependency.
The core components:
- Prometheus server — scrapes targets, stores samples locally (or remotely), evaluates rules, sends alerts.
- Exporters — sidecar processes that expose metrics in Prometheus format (node_exporter for Linux, postgres_exporter for Postgres, blackbox_exporter for probes).
- Alertmanager — deduplicates, groups, routes, and silences alerts to PagerDuty, Slack, or email.
- Grafana (or similar) — visualizes PromQL queries; not part of Prometheus but almost always paired with it.
Pull vs push
Prometheus scrapes an HTTP /metrics endpoint on each target every
scrape_interval (often 15–60 seconds). Pulling means Prometheus
controls discovery and knows immediately when a target is down — missing scrapes
are visible. Push-based systems (StatsD, some SaaS agents) can work behind NAT
but hide target health from the collector. For short-lived batch jobs, use the
Pushgateway sparingly — it is not a generic message queue and
can misrepresent job state if misused.
Metric types and naming conventions
Every sample is a named metric plus a set of labels (key-value
pairs) and a floating-point value at a timestamp. Prometheus defines four
instrument types; client libraries expose them with _total,
_bucket, or _sum suffix conventions:
- Counter — monotonically increasing (total HTTP requests, bytes sent). Use
rate()orincrease()in PromQL; never graph raw counters on long windows without wrapping. - Gauge — can go up or down (memory in use, queue length, temperature). Graph directly or use
deriv()for trends. - Histogram — observations bucketed by upper bounds (request latency). Enables percentile estimates via
histogram_quantile()without storing every event. - Summary — client-side quantiles over a sliding window. Prefer histograms in most services — summaries are harder to aggregate across replicas.
Name metrics with a single unit and domain prefix: http_requests_total,
process_resident_memory_bytes. Follow the
official naming guidelines
so dashboards and alerts port across teams. Expose a /metrics handler
from your app (prometheus/client_golang, prom-client for Node, etc.) or rely on
exporters for software you do not control.
Labels, cardinality, and scrape configuration
Labels are how you slice metrics — method="GET",
status="500", service="payments". They are also how
Prometheus indexes data; every unique label combination creates a new
time series. High-cardinality labels (user IDs, order IDs,
unbounded path segments) can explode memory use and slow queries. Rule of thumb:
keep total active series per job under low millions; alert on
prometheus_tsdb_head_series growth.
Scrape config lives in prometheus.yml or Kubernetes
ServiceMonitor CRDs. Key fields:
scrape_intervalandscrape_timeout— balance freshness vs load.metrics_path— default/metrics; some apps use/actuator/prometheus.relabel_configs— rewrite labels before ingest (drop noisy labels, setenvironment).kubernetes_sd_configsorfile_sd_configs— dynamic target discovery instead of static IP lists.
For external uptime checks, point blackbox_exporter at URLs and scrape its metrics. For databases and queues, run dedicated exporters on the same network segment — never expose database ports to the public internet just for metrics.
PromQL essentials
PromQL (Prometheus Query Language) selects and aggregates time
series. Instant vectors return one value per series at a point in time; range
vectors cover a window (e.g. [5m]) for functions like rate().
Patterns you will use weekly:
- Request rate:
sum(rate(http_requests_total{job="api"}[5m])) by (method) - Error ratio:
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) - p99 latency:
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) - CPU per pod:
rate(process_cpu_seconds_total[5m])on cgroups or node_exporter metrics
Use offset and @ modifiers to compare week-over-week.
Recording rules precompute expensive expressions (e.g. per-service
error rates) on a schedule so dashboards and alerts stay fast. Store rules in
separate files and validate with promtool check rules in CI.
Alerting with Alertmanager
Prometheus alerting rules evaluate PromQL on an interval; firing alerts go to Alertmanager, not directly to humans. Alertmanager handles:
- Grouping — one notification per incident cluster, not one per pod.
- Inhibition — suppress “high latency” when “service down” already fired.
- Routing trees — severity, team, and environment labels route to different on-call rotations.
- Silences and maintenance windows — scheduled deploys should not page the whole org.
Write alerts on user-visible symptoms (SLO burn rate, elevated 5xx) rather than causes (single pod CPU spike). Pair with liveness and readiness probes in Kubernetes — probes restart unhealthy pods; Prometheus tells you the fleet is still failing after restarts. Multi-window burn-rate alerts (Google SRE book pattern) catch both fast and slow SLO violations without alert fatigue.
Worked example: Harbor Fleet observability rollout
Harbor Fleet runs three services on a single VPS behind nginx: a public API, a background worker, and a static marketing site. The team wants RED metrics (rate, errors, duration) per service without a full Kubernetes migration.
- Instrument the API — add prometheus middleware: counter
http_requests_total{method,route,status}, histogramhttp_request_duration_secondswith buckets tuned to their 200 ms SLO (0.05, 0.1, 0.25, 0.5, 1, 2.5). - Scrape node_exporter on
:9100for disk, memory, and CPU — disk alerts matter because SQLite backups land on the same volume. - Blackbox probe — synthetic GET to
/healthevery 30 s from outside the VPC; catches nginx misconfig that in-process metrics miss. - Recording rules —
job:http_requests:rate5mandjob:http_errors:ratio5mpre-aggregated for Grafana rows. - Alerts — page if error ratio > 2% for 10 m; ticket if p99 > 500 ms for 30 m; warn on disk > 85% for 1 h.
- Runbook links — each alert annotation includes a link to structured logs filtered by
trace_idand the deploy version label.
After two weeks they discover worker queue depth (a gauge) predicts API latency better than CPU — they add a Grafana row and demote CPU alerts to informational. That feedback loop is why metrics belong in the same repo culture as application code.
Tooling decision table
| Need | Reach for | Why |
|---|---|---|
| Self-hosted metrics on K8s or VPS | Prometheus + Grafana | Mature ecosystem, PromQL, huge exporter library, no per-host SaaS bill. |
| Managed metrics with long retention | Grafana Cloud, AWS AMP, Google Managed Prometheus | Ops offload; still PromQL-compatible in most offerings. |
| Full-stack APM with auto-instrumentation | Datadog, New Relic | Faster onboarding, higher cost, vendor lock-in on query DSL. |
| Ephemeral batch job metrics | Pushgateway (carefully) or native push via remote write | Pull model assumes long-lived targets; batch needs explicit design. |
| Log-only debugging | Loki or ELK — not Prometheus alone | Metrics show symptoms; logs carry stack traces and context. |
Common pitfalls
- Cardinality explosion — putting
user_idor raw URL paths on HTTP metrics; use bounded route templates instead. - Graphing raw counters — always apply
rate()orincrease()over a window matched to scrape interval. - Missing
instancededup on aggregates — summing across replicas withoutsum by (job)double-counts during rollouts. - Alerting on causes, not symptoms — paging on “one pod restarted” while user-facing SLO is green trains on-call to ignore pages.
- No retention plan — default 15-day local retention may be fine for ops; compliance may need remote write to object storage.
- Scraping through auth-less public endpoints — metrics can leak internal topology; protect with network policy or mTLS.
- Pushgateway as a cron buffer — stale metrics from finished jobs linger and mislead dashboards.
Production checklist
- Define RED (or USE) metrics per service before writing alert rules.
- Standardize metric and label names across teams; document allowed label keys.
- Run
promtool check configandpromtool check rulesin CI. - Set scrape intervals appropriate to SLO windows — sub-minute for user-facing APIs.
- Pre-aggregate hot queries with recording rules; keep dashboard PromQL simple.
- Route alerts through Alertmanager with grouping, inhibition, and runbook URLs.
- Monitor Prometheus itself: TSDB head series, scrape failures, rule evaluation lag.
- Align burn-rate alerts with documented SLOs and error-budget policies.
- Pair metrics with traces/logs via consistent
trace_idor request ID labels where safe. - Test alert paths quarterly — a silent PagerDuty integration is worse than no alert.
Key takeaways
- Prometheus pulls metrics from HTTP endpoints on a schedule and stores labeled time series optimized for operational queries.
- Use counters for totals, gauges for point-in-time values, and histograms for latency percentiles — avoid high-cardinality labels.
- PromQL
rate(), ratios, andhistogram_quantile()power most dashboards and SLO alerts. - Alertmanager groups and routes symptoms; recording rules keep queries fast at scale.
- Metrics are one pillar — combine with logs, traces, and health probes for complete observability.
Related reading
- Observability explained — metrics, logs, and traces as three pillars
- SLOs and error budgets explained — burn-rate alerting on top of Prometheus
- Kubernetes fundamentals explained — ServiceMonitor patterns and pod metrics
- Structured logging explained — correlate logs with metric spikes