Guide

Grafana explained

Your Prometheus stack is scraping metrics, Alertmanager is wired to Slack, and on-call still opens five browser tabs every time a burn-rate alert fires. Grafana is the open-source visualization and alerting platform that turns time-series data from Prometheus, logs from Loki, traces from Tempo, and dozens of other backends into dashboards humans can read in seconds. It does not replace your metrics database — it sits on top, querying data sources on demand and rendering panels, variables, and annotations that make incidents debuggable. This guide covers Grafana’s architecture, dashboard design patterns (RED and USE), data source configuration, variables and templating, unified alerting, dashboard-as-code provisioning, a Harbor Fleet SRE board worked example, a tooling decision table, common pitfalls, and a production checklist alongside our observability overview and SLO error-budget guide.

What Grafana is — and what it is not

Grafana is a query and visualization engine with a web UI, role-based access control, and (since Grafana 8+) a built-in alerting scheduler. You connect one or more data sources — Prometheus, InfluxDB, PostgreSQL, Elasticsearch, CloudWatch, and many more — then build dashboards composed of panels (time series, stat, gauge, table, heatmap, logs). Each panel runs a query against a data source and maps the result to a visual.

Grafana is not a metrics store. It does not scrape endpoints or retain long-term samples unless you use Grafana Mimir or Grafana Cloud as a backend. It is also not a log indexer — pair it with Loki or Elasticsearch for log panels. Think of Grafana as the control room glass: Prometheus (or your vendor APM) is the instrument panel wiring; Grafana is the layout that tells operators where to look first.

Core objects

  • Organization and folders — multi-tenant boundaries; folders group dashboards by team or environment.
  • Data source — connection config (URL, auth, default query settings) shared across dashboards.
  • Dashboard — JSON document of rows, panels, variables, and time range defaults.
  • Panel — one visualization with queries, transforms, thresholds, and unit formatting.
  • Alert rule — PromQL or SQL condition evaluated on a schedule; routes to contact points (Slack, PagerDuty, webhook).

Connecting Prometheus and other data sources

The most common setup: Grafana reads from a local or remote Prometheus server. Add a data source with the Prometheus base URL (e.g. http://prometheus:9090), set Scrape interval to match your Prometheus config (15 s or 60 s), and enable HTTP method POST for heavy range queries on large dashboards.

Best practices for data source hygiene:

  • One Prometheus per environment — separate data sources for prod and staging; never mix them on one dashboard without a variable guard.
  • Default min step — set to scrape interval so Grafana does not over-sample and slow Prometheus.
  • Exemplars and traces — when Prometheus stores trace IDs on histogram buckets, enable exemplars and link to Tempo or Jaeger for click-through from latency spikes.
  • Mixed data sources — a single dashboard can query Prometheus for rates and PostgreSQL for business KPIs; use transforms to join on time or label.

For logs, add Loki with LogQL; for traces, Tempo or Jaeger. Grafana’s Explore view is ad-hoc query mode — use it during incidents, then promote stable queries into dashboard panels.

Dashboard design: RED, USE, and hierarchy

Good dashboards answer one question per row. The RED method (Rate, Errors, Duration) suits request-driven services; USE (Utilization, Saturation, Errors) suits infrastructure like CPU, disk, and network. Start with a summary row — big stat panels for current error ratio, p99 latency, and requests per second — then drill-down rows by dependency, region, or deployment version.

Panel types that earn their pixels

  • Time series — default for rates and latency; use stacked mode sparingly (hard to read past three series).
  • Stat — single number with sparkline; ideal for SLO compliance percentage.
  • Gauge — utilization with thresholds (green/yellow/red); cap at 100% for CPU, not for unbounded queue depth.
  • Heatmap — latency bucket distribution over time; surfaces tail shifts better than a lone p99 line.
  • Table — top-N slow endpoints or highest error contributors; add data links to runbooks.

Avoid chart junk: more than twelve series on one panel, 3D effects, and duplicate panels that show the same PromQL with different colors. Align time ranges across rows (use dashboard-level $__interval) and set sensible defaults — last 1 h for APIs, last 24 h for batch pipelines. Annotate deploys with Grafana annotations (API or CI webhook) so latency jumps correlate with release markers.

Variables, templating, and reusable dashboards

Hard-coding job="api-prod" in every panel creates forked dashboards per service. Dashboard variables parameterize queries instead:

  • Query variablelabel_values(up, job) populates a service dropdown from Prometheus.
  • Custom variable — fixed environment list (prod,staging).
  • Chained variables — region filters instance list: label_values(up{region="$region"}, instance).
  • Multi-select and Include All — compare shards or aggregate with sum by (job) when All is selected.

Use ${var:regex} in alert labels when variables feed recording rules. Name variables consistently across the org (service, environment, cluster) so on-call muscle memory transfers between teams. For golden-signal templates, publish one canonical RED dashboard in a Git repo and provision it to every cluster rather than letting each team clone and drift.

PromQL in Grafana: queries that stay fast

Grafana sends range queries to Prometheus with a calculated step. Heavy dashboards can overload Prometheus during incidents — exactly when you need them most. Mitigations:

  • Prefer recording rules in Prometheus for expensive expressions; panels query job:http_requests:rate5m instead of raw counters.
  • Set panel Min interval to at least the scrape interval.
  • Use Instant queries for stat panels (current value) instead of range queries over the full dashboard window.
  • Limit topk() in table panels — top 10 slow routes, not top 1000.

Legend templates ({{method}} {{status}}) and unit overrides (seconds (s), percent (0–100)) belong in dashboard JSON so every viewer sees consistent formatting. Use Transformations to filter, rename, and join query results without pushing complexity into PromQL when SQL-style joins are clearer.

Alerting: Grafana unified alerts vs Alertmanager

Teams run alerts in two places: Prometheus rules → Alertmanager, and Grafana managed alerts. Pick one primary path to avoid duplicate pages.

  • Alertmanager — best when all rules are PromQL on Prometheus data; mature grouping, inhibition, and silences.
  • Grafana unified alerting — multi-data-source conditions (Prometheus + Loki + SQL), visual rule builder, contact points in Grafana UI.

Whichever you choose, alert on symptoms tied to SLOs — error budget burn, elevated 5xx ratio — not every threshold twitch. Grafana alert rules support pending periods (fire only if condition holds for N minutes) and no-data handling (alert when metrics stop, which often means scrape failure). Link annotations to runbooks and health-check docs so the first responder knows whether to restart pods or roll back a deploy.

Dashboard-as-code and provisioning

ClickOps dashboards diverge. Treat dashboard JSON like application code:

  • Export dashboards to Git via the Grafana API or grafana/dashboard Terraform provider.
  • Mount provisioning YAML in /etc/grafana/provisioning/dashboards/ pointing at a ConfigMap or volume of JSON files.
  • Run Grafonnet or Jsonnet to generate per-service variants from one template.
  • Review dashboard changes in PRs — diff JSON or use grafana-dashboard-lint in CI.

Version data source URLs and credentials via environment variables or secrets managers — not committed plaintext. In Kubernetes, run Grafana as a Deployment with persistent volume for SQLite (small teams) or external PostgreSQL for HA. Sync LDAP or OAuth SSO so dashboard permissions mirror your org structure.

Worked example: Harbor Fleet SRE board

Harbor Fleet already runs Prometheus on a VPS behind nginx. They add Grafana in Docker on the same host, provision a Prometheus data source, and ship one dashboard repo.

  1. Variablesservice from label_values(up{job=~"api|worker|site"}, job); interval custom (5m,15m,1h).
  2. Row 1 — Golden signals — stat panels: RPS (sum(rate(http_requests_total{job="$service"}[$interval]))), error ratio, p99 latency from histogram quantile.
  3. Row 2 — Dependencies — time series of worker queue depth vs API latency on dual axis; reveals backlog-driven slowdowns.
  4. Row 3 — Node health — USE panels from node_exporter: CPU utilization, disk saturation, network errors.
  5. Row 4 — Logs — Loki panel filtered {job="$service"} |= "error" with JSON parsed fields; linked from spike annotations.
  6. Annotations — CI posts deploy events via Grafana API; vertical markers on latency row.
  7. Alerts — Grafana rule: error ratio > 2% for 10 m routes to Slack #fleet-oncall; Prometheus keeps burn-rate rules in Alertmanager to avoid duplicate pages.

After one incident, they add a drill-down link from the error stat panel to a pre-filtered Explore view with the same labels. Mean time to triage drops because the dashboard tells a story top-to-bottom instead of dumping raw metrics.

Tooling decision table

NeedReach forWhy
Open-source metrics + dashboards on your VPS or K8sPrometheus + GrafanaIndustry default, PromQL, huge community dashboards on grafana.com.
Managed Grafana with long retention and SSOGrafana CloudOps offload; hosted Mimir/Loki/Tempo stack with unified billing.
Log-centric operations with metrics secondaryGrafana + Loki, or Kibana + ElasticsearchLoki pairs natively with Grafana; Kibana wins if Elasticsearch is already your primary store.
Full SaaS APM with auto-instrumentationDatadog, New Relic, DynatraceFaster onboarding, integrated alerts, higher cost and query lock-in.
Business BI on warehouse dataMetabase, Looker, SupersetSQL-first analytics for product and finance — not substitute for ops dashboards.
Embedded analytics in a customer-facing productCustom charts or embedded Grafana/IFrame with authOps dashboards are too noisy for end users; scope permissions carefully.

Common pitfalls

  • Dashboard sprawl — hundreds of unmaintained clones; enforce one template per service type.
  • Over-querying Prometheus — dashboards with 40 panels each firing range queries; use recording rules and reduce refresh rate on TV boards.
  • Dual alerting paths — same condition in Alertmanager and Grafana; pick one owner per signal.
  • Variables that break “All” — regex not escaped; test multi-select with Include All option.
  • Wrong units — latency shown as milliseconds when Prometheus exports seconds; always set unit overrides.
  • No on-call runbook links — pretty graphs without “what to do next” waste minutes during outages.
  • Public Grafana without auth — dashboards expose internal topology; enforce SSO or VPN.

Production checklist

  • Define one golden RED (or USE) dashboard template per service class.
  • Provision data sources and dashboards from Git; no manual-only production edits.
  • Align Grafana scrape/min interval with Prometheus scrape_interval.
  • Pre-aggregate hot PromQL in recording rules; keep panel queries simple.
  • Standardize variables: environment, service, cluster.
  • Annotate deploys and config changes automatically from CI.
  • Document whether Alertmanager or Grafana owns each alert route.
  • Enable SSO and folder-level RBAC; audit viewer vs editor roles.
  • Test contact points quarterly — broken Slack webhooks fail silently.
  • Review dashboard load on Prometheus during game days; throttle refresh on wall displays.

Key takeaways

  • Grafana visualizes metrics and logs from many backends; it does not store time series itself.
  • Design dashboards around RED or USE with a summary row and drill-down hierarchy.
  • Variables and provisioning keep dashboards DRY and reviewable in Git.
  • Pair Grafana with Prometheus recording rules to keep queries fast under incident load.
  • Choose one primary alerting path and link every panel to actionable runbooks.

Related reading