Guide

Prometheus monitoring explained

Prometheus is the open-source time-series database and monitoring system that became the default metrics layer for Kubernetes, microservices, and most cloud-native stacks. Unlike log aggregators that ingest events after the fact, Prometheus pulls metrics from your services on a schedule, stores them as labeled samples, and exposes a powerful query language (PromQL) for dashboards and alerts. Whether you are instrumenting a Node.js API, scraping Linux host stats with the node exporter, or wiring burn-rate alerts to SLO error budgets, Prometheus is usually in the path. This guide covers the pull model and scrape config, the four metric types, labels and cardinality traps, PromQL patterns for rates and percentiles, recording rules and Alertmanager routing, a Harbor Fleet multi-service worked example, a tooling decision table, common pitfalls, and a production checklist. For the broader observability picture, see observability explained; for request paths across services, pair metrics with distributed tracing.

How Prometheus fits the observability stack

Prometheus owns the metrics pillar — numeric measurements aggregated over time: request rates, error ratios, queue depth, memory usage. It does not replace structured logs or distributed traces; it complements them. A typical incident workflow: a Prometheus alert fires on elevated 5xx rate, you open Grafana to slice by endpoint and deployment version, then pivot to traces or logs with a shared correlation ID to find the failing dependency.

The core components:

Prometheus server — scrapes targets, stores samples locally (or remotely), evaluates rules, sends alerts.
Exporters — sidecar processes that expose metrics in Prometheus format (node_exporter for Linux, postgres_exporter for Postgres, blackbox_exporter for probes).
Alertmanager — deduplicates, groups, routes, and silences alerts to PagerDuty, Slack, or email.
Grafana (or similar) — visualizes PromQL queries; not part of Prometheus but almost always paired with it.

Pull vs push

Prometheus scrapes an HTTP /metrics endpoint on each target every scrape_interval (often 15–60 seconds). Pulling means Prometheus controls discovery and knows immediately when a target is down — missing scrapes are visible. Push-based systems (StatsD, some SaaS agents) can work behind NAT but hide target health from the collector. For short-lived batch jobs, use the Pushgateway sparingly — it is not a generic message queue and can misrepresent job state if misused.

Metric types and naming conventions

Every sample is a named metric plus a set of labels (key-value pairs) and a floating-point value at a timestamp. Prometheus defines four instrument types; client libraries expose them with _total, _bucket, or _sum suffix conventions:

Counter — monotonically increasing (total HTTP requests, bytes sent). Use rate() or increase() in PromQL; never graph raw counters on long windows without wrapping.
Gauge — can go up or down (memory in use, queue length, temperature). Graph directly or use deriv() for trends.
Histogram — observations bucketed by upper bounds (request latency). Enables percentile estimates via histogram_quantile() without storing every event.
Summary — client-side quantiles over a sliding window. Prefer histograms in most services — summaries are harder to aggregate across replicas.

Name metrics with a single unit and domain prefix: http_requests_total, process_resident_memory_bytes. Follow the official naming guidelines so dashboards and alerts port across teams. Expose a /metrics handler from your app (prometheus/client_golang, prom-client for Node, etc.) or rely on exporters for software you do not control.

Labels, cardinality, and scrape configuration

Labels are how you slice metrics — method="GET", status="500", service="payments". They are also how Prometheus indexes data; every unique label combination creates a new time series. High-cardinality labels (user IDs, order IDs, unbounded path segments) can explode memory use and slow queries. Rule of thumb: keep total active series per job under low millions; alert on prometheus_tsdb_head_series growth.

Scrape config lives in prometheus.yml or Kubernetes ServiceMonitor CRDs. Key fields:

scrape_interval and scrape_timeout — balance freshness vs load.
metrics_path — default /metrics; some apps use /actuator/prometheus.
relabel_configs — rewrite labels before ingest (drop noisy labels, set environment).
kubernetes_sd_configs or file_sd_configs — dynamic target discovery instead of static IP lists.

For external uptime checks, point blackbox_exporter at URLs and scrape its metrics. For databases and queues, run dedicated exporters on the same network segment — never expose database ports to the public internet just for metrics.

PromQL essentials

PromQL (Prometheus Query Language) selects and aggregates time series. Instant vectors return one value per series at a point in time; range vectors cover a window (e.g. [5m]) for functions like rate().

Patterns you will use weekly:

Request rate: sum(rate(http_requests_total{job="api"}[5m])) by (method)
Error ratio: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
p99 latency: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
CPU per pod: rate(process_cpu_seconds_total[5m]) on cgroups or node_exporter metrics

Use offset and @ modifiers to compare week-over-week. Recording rules precompute expensive expressions (e.g. per-service error rates) on a schedule so dashboards and alerts stay fast. Store rules in separate files and validate with promtool check rules in CI.

Alerting with Alertmanager

Prometheus alerting rules evaluate PromQL on an interval; firing alerts go to Alertmanager, not directly to humans. Alertmanager handles:

Grouping — one notification per incident cluster, not one per pod.
Inhibition — suppress “high latency” when “service down” already fired.
Routing trees — severity, team, and environment labels route to different on-call rotations.
Silences and maintenance windows — scheduled deploys should not page the whole org.

Write alerts on user-visible symptoms (SLO burn rate, elevated 5xx) rather than causes (single pod CPU spike). Pair with liveness and readiness probes in Kubernetes — probes restart unhealthy pods; Prometheus tells you the fleet is still failing after restarts. Multi-window burn-rate alerts (Google SRE book pattern) catch both fast and slow SLO violations without alert fatigue.

Worked example: Harbor Fleet observability rollout

Harbor Fleet runs three services on a single VPS behind nginx: a public API, a background worker, and a static marketing site. The team wants RED metrics (rate, errors, duration) per service without a full Kubernetes migration.

Instrument the API — add prometheus middleware: counter http_requests_total{method,route,status}, histogram http_request_duration_seconds with buckets tuned to their 200 ms SLO (0.05, 0.1, 0.25, 0.5, 1, 2.5).
Scrape node_exporter on :9100 for disk, memory, and CPU — disk alerts matter because SQLite backups land on the same volume.
Blackbox probe — synthetic GET to /health every 30 s from outside the VPC; catches nginx misconfig that in-process metrics miss.
Recording rules — job:http_requests:rate5m and job:http_errors:ratio5m pre-aggregated for Grafana rows.
Alerts — page if error ratio > 2% for 10 m; ticket if p99 > 500 ms for 30 m; warn on disk > 85% for 1 h.
Runbook links — each alert annotation includes a link to structured logs filtered by trace_id and the deploy version label.

After two weeks they discover worker queue depth (a gauge) predicts API latency better than CPU — they add a Grafana row and demote CPU alerts to informational. That feedback loop is why metrics belong in the same repo culture as application code.

Tooling decision table

Need	Reach for	Why
Self-hosted metrics on K8s or VPS	Prometheus + Grafana	Mature ecosystem, PromQL, huge exporter library, no per-host SaaS bill.
Managed metrics with long retention	Grafana Cloud, AWS AMP, Google Managed Prometheus	Ops offload; still PromQL-compatible in most offerings.
Full-stack APM with auto-instrumentation	Datadog, New Relic	Faster onboarding, higher cost, vendor lock-in on query DSL.
Ephemeral batch job metrics	Pushgateway (carefully) or native push via remote write	Pull model assumes long-lived targets; batch needs explicit design.
Log-only debugging	Loki or ELK — not Prometheus alone	Metrics show symptoms; logs carry stack traces and context.

Common pitfalls

Cardinality explosion — putting user_id or raw URL paths on HTTP metrics; use bounded route templates instead.
Graphing raw counters — always apply rate() or increase() over a window matched to scrape interval.
Missing instance dedup on aggregates — summing across replicas without sum by (job) double-counts during rollouts.
Alerting on causes, not symptoms — paging on “one pod restarted” while user-facing SLO is green trains on-call to ignore pages.
No retention plan — default 15-day local retention may be fine for ops; compliance may need remote write to object storage.
Scraping through auth-less public endpoints — metrics can leak internal topology; protect with network policy or mTLS.
Pushgateway as a cron buffer — stale metrics from finished jobs linger and mislead dashboards.

Production checklist

Define RED (or USE) metrics per service before writing alert rules.
Standardize metric and label names across teams; document allowed label keys.
Run promtool check config and promtool check rules in CI.
Set scrape intervals appropriate to SLO windows — sub-minute for user-facing APIs.
Pre-aggregate hot queries with recording rules; keep dashboard PromQL simple.
Route alerts through Alertmanager with grouping, inhibition, and runbook URLs.
Monitor Prometheus itself: TSDB head series, scrape failures, rule evaluation lag.
Align burn-rate alerts with documented SLOs and error-budget policies.
Pair metrics with traces/logs via consistent trace_id or request ID labels where safe.
Test alert paths quarterly — a silent PagerDuty integration is worse than no alert.

Key takeaways

Prometheus pulls metrics from HTTP endpoints on a schedule and stores labeled time series optimized for operational queries.
Use counters for totals, gauges for point-in-time values, and histograms for latency percentiles — avoid high-cardinality labels.
PromQL rate(), ratios, and histogram_quantile() power most dashboards and SLO alerts.
Alertmanager groups and routes symptoms; recording rules keep queries fast at scale.
Metrics are one pillar — combine with logs, traces, and health probes for complete observability.