Guide

ClickHouse explained

Your product dashboard runs SELECT date, COUNT(*) FROM events GROUP BY 1 against PostgreSQL. At 200 million rows it takes 45 seconds and locks the replica. You add read replicas, then caching, then pre-aggregated tables maintained by cron — each layer adds lag and operational debt. ClickHouse is an open-source, column-oriented OLAP database designed for exactly this workload: append-heavy event streams, sub-second aggregations over billions of rows, and high-cardinality GROUP BY queries that crush row stores. Originally built at Yandex for web analytics, ClickHouse powers observability backends, product analytics, ad-tech reporting, and financial tick data worldwide. This guide covers columnar storage and the MergeTree engine family, primary key vs ORDER BY design, partitioning and TTL, materialized views, replication and sharding, ingest patterns, a Harbor Fleet clickstream worked example, a database decision table, common pitfalls, and a production checklist alongside our DuckDB guide, data warehouse overview, and ETL/ELT pipelines guide.

What ClickHouse is — and what it is not

ClickHouse is a distributed columnar database optimized for analytical queries (OLAP), not transactional row updates (OLTP). Data is stored by column, compressed with codecs tuned per type (LZ4, ZSTD, Delta, Gorilla for timestamps), and scanned in vectorized batches across CPU cores. Inserts append immutable parts; background merges compact small parts into larger ones — there is no in-place row update like PostgreSQL's heap page.

ClickHouse is not a replacement for PostgreSQL as your system of record. Use it as an analytics sink: events, metrics, logs, and denormalized reporting tables fed by CDC, Kafka, or batch ELT jobs. Mutations (ALTER TABLE ... UPDATE) exist but are asynchronous and expensive — design schemas for append and aggregate, not frequent row edits.

Core concepts

  • Table engine — storage and behavior contract; MergeTree family is the default for analytics.
  • Part — immutable on-disk chunk created per insert batch; merged in the background.
  • Primary key / ORDER BY — sparse index over sorted data; determines data layout and prune efficiency (not a uniqueness constraint).
  • Partition — logical slice (often by month) enabling partition-level drops and targeted merges.
  • Granule — default 8192-row block indexed by the primary key; min/max statistics skip irrelevant granules.
  • Materialized view — trigger that writes transformed rows to a target table on each insert.
  • Replication — ZooKeeper or ClickHouse Keeper coordinates identical copies via ReplicatedMergeTree.

MergeTree engines: the foundation

MergeTree is the workhorse. On insert, ClickHouse sorts rows by the ORDER BY expression, writes a new part, and updates the sparse primary-key index. Queries with predicates matching the sort key prefix skip granules whose min/max bounds exclude the filter range — the same zone-map idea as DuckDB and Parquet, but managed continuously as data arrives.

CREATE TABLE events (
  event_time DateTime,
  user_id UInt64,
  event_type LowCardinality(String),
  properties String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_type, event_time, user_id)
TTL event_time + INTERVAL 90 DAY;

Specialized variants solve common patterns without application-side dedup logic:

  • ReplacingMergeTree — keeps the latest row per sort key during merges (eventual dedup; use FINAL or argMax in queries until merge completes).
  • SummingMergeTree — sums numeric columns for matching keys during merges; ideal for pre-aggregated counters.
  • AggregatingMergeTree — stores intermediate aggregate states (AggregateFunction) for rollups at query time.
  • CollapsingMergeTree / VersionedCollapsingMergeTree — sign column (+1/-1) or version for change-data streams.

Designing ORDER BY for your queries

Put the highest-selectivity filter columns first in ORDER BY, then time. If dashboards always filter WHERE event_type = 'purchase' and group by day, leading with event_type beats leading with user_id. Low-cardinality dimensions benefit from the LowCardinality(String) type — dictionary encoding shrinks storage and speeds GROUP BY. Test with EXPLAIN indexes = 1 to see how many granules a query reads.

Partitioning, TTL, and storage management

PARTITION BY toYYYYMM(event_time) creates one partition per calendar month. Benefits: drop old data with ALTER TABLE DROP PARTITION instead of expensive DELETE scans; restrict merges to hot partitions; align backups with business retention. Avoid ultra-granular partitions (per day on low-volume tables) — each partition carries metadata overhead and too many small parts slow merges.

TTL rules move rows to cold storage (TO VOLUME) or delete them when timestamps expire — automating retention without cron jobs. Combine TTL with tiered storage: NVMe for last 7 days, S3-backed volume for 90-day archive. Compression codecs per column further cut disk: ZSTD on JSON strings, Delta on monotonic IDs, DoubleDelta on timestamps.

Merge tuning

Background merges are ClickHouse's compaction loop. Watch system.parts for part count — hundreds of tiny parts per partition mean merge backlog and slower queries. Batch inserts (10k–100k rows per block) and avoid single-row inserts from application code. The max_parts_in_total and parts_to_throw_insert settings protect against runaway part explosion.

Materialized views and real-time rollups

Materialized views in ClickHouse are insert triggers, not stored query results like PostgreSQL. When rows land in a source table, the MV's SELECT transforms and inserts into a target table — enabling real-time dashboards without scanning raw events every time.

CREATE TABLE events_daily (
  day Date,
  event_type LowCardinality(String),
  event_count UInt64
)
ENGINE = SummingMergeTree()
ORDER BY (day, event_type);

CREATE MATERIALIZED VIEW events_daily_mv TO events_daily AS
SELECT toDate(event_time) AS day,
       event_type,
       count() AS event_count
FROM events
GROUP BY day, event_type;

New inserts into events automatically populate events_daily. For funnel and session metrics, chain MVs or use AggregatingMergeTree with uniqState / quantileState for approximate distinct counts at scale. Backfill historical data with INSERT INTO events_daily SELECT ... after creating the MV — MVs do not retroactively process existing rows.

Ingest, replication, and distributed queries

Ingest patterns

  • Batch INSERT — preferred; use FORMAT JSONEachRow, Native, or Parquet over HTTP/Native protocol.
  • Kafka engine table — ClickHouse consumes a topic directly; MV transforms into MergeTree.
  • S3 / GCS table functionsINSERT INTO t SELECT * FROM s3(...) for lake ingestion, similar to DuckDB's file queries but cluster-scale.
  • clickhouse-client--max_insert_block_size and async inserts for high-throughput pipelines.

Use async_insert when many small writers hit one node — the server buffers and flushes larger parts. For exactly-once semantics from Kafka, pair with idempotent consumers and ReplacingMergeTree keyed on event ID.

Replication and sharding

ReplicatedMergeTree keeps N copies in sync via a coordination service (ClickHouse Keeper — a ZooKeeper-compatible embedded consensus layer). Sharding spreads data across nodes with a Distributed table that routes inserts and fans out SELECTs. The typical cluster layout: 3 shards × 2 replicas for both throughput and fault tolerance. Distributed joins are expensive — prefer denormalized wide tables or pre-joined materialized views for dashboard queries.

Worked example: Harbor Fleet clickstream analytics

Harbor Fleet's web storefront emits JSON events to Kafka: (event_time, session_id, user_id, page, action, revenue). The SRE team needs a Grafana dashboard with page views per minute, conversion rate, and 90-day retention without hammering PostgreSQL.

CREATE TABLE clickstream (
  event_time DateTime,
  session_id UUID,
  user_id UInt64,
  page LowCardinality(String),
  action LowCardinality(String),
  revenue Decimal(10,2)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/clickstream', '{replica}')
PARTITION BY toYYYYMM(event_time)
ORDER BY (page, event_time)
TTL event_time + INTERVAL 90 DAY;

-- Kafka ingest (simplified)
CREATE TABLE clickstream_kafka ENGINE = Kafka
  SETTINGS kafka_broker_list = 'kafka:9092',
           kafka_topic_list = 'harbor.clicks',
           kafka_format = 'JSONEachRow';

CREATE MATERIALIZED VIEW clickstream_kafka_mv TO clickstream AS
  SELECT * FROM clickstream_kafka;

-- Dashboard query: conversion rate last hour
SELECT page,
       countIf(action = 'view') AS views,
       countIf(action = 'purchase') AS purchases,
       purchases / views AS conversion
FROM clickstream
WHERE event_time >= now() - INTERVAL 1 HOUR
GROUP BY page
ORDER BY views DESC
LIMIT 20;

At 50k events per second, three ClickHouse nodes sustain sub-200ms dashboard queries because only the page and event_time columns are read for this aggregation. Revenue totals use a separate SummingMergeTree MV keyed by day. Alerts hook into Grafana via the ClickHouse data source; slow-query logging feeds Prometheus metrics on merge lag and insert throughput.

Decision table: ClickHouse vs alternatives

Need Reach for Why
Billions of events, real-time dashboards, cluster scale ClickHouse Columnar MergeTree, MV rollups, replication, proven at web-analytics scale
Local ad hoc analytics on Parquet/CSV files DuckDB Embedded, no cluster ops, ideal for laptop and CI
Managed multi-tenant warehouse, SQL BI at enterprise scale Snowflake / BigQuery Separation of storage/compute, governance, less ops — higher cost
Transactional CRUD, foreign keys, row-level locking PostgreSQL OLTP-native; use as source, not analytics sink at billion-row scale
Full-text search and log exploration with relevance scoring Elasticsearch Inverted indexes and BM25; aggregations slower than ClickHouse at extreme cardinality
Time-series metrics with Prometheus ecosystem Prometheus / VictoriaMetrics Pull model and PromQL; ClickHouse wins for rich dimensional event analytics

Query optimization habits

ClickHouse rewards the same discipline as our SQL optimization guide, with OLAP-specific twists:

  • Filter on partition key and ORDER BY prefix columns first.
  • Avoid SELECT * — read only columns you aggregate.
  • Prefer uniqCombined or uniqHLL12 over exact COUNT(DISTINCT) on high-cardinality keys.
  • Use PREWHERE for selective predicates — ClickHouse evaluates them before reading other columns.
  • Sample large tables with SAMPLE 0.1 for exploratory queries.
  • Check system.query_log for bytes read and mark whether indexes were used.

Common pitfalls

  • Single-row inserts — creates part explosion and merge storms; batch or use async_insert.
  • Wrong ORDER BY — leading with high-cardinality user_id when queries filter by event_type forces full scans.
  • Too many partitions — daily partitions on low-volume tables bloat metadata and slow startup.
  • Expecting immediate dedup — ReplacingMergeTree deduplicates at merge time; query with FINAL or argMax only when needed (FINAL is expensive).
  • Mutations for frequent updates — rewrites parts asynchronously; use CollapsingMergeTree or external OLTP instead.
  • Distributed JOIN across shards — shuffles massive datasets; denormalize or pre-aggregate.
  • Ignoring replication lag — monitor system.replicas for queue backlog after node failures.
  • No TTL or partition drops — disk fills; merges cannot keep up with ingest on full volumes.

Production checklist

  • Choose MergeTree variant matching dedup/rollup needs; document ORDER BY rationale against top 10 queries.
  • Partition by month (or coarser) on time-series tables; set TTL aligned with compliance retention.
  • Deploy ReplicatedMergeTree with Keeper quorum; test failover and replica catch-up.
  • Ingest via Kafka engine or batched inserts (10k+ rows); enable async_insert for small writers.
  • Create materialized views for dashboard aggregates; backfill historical rollups explicitly.
  • Configure per-column codecs; tier cold data to object storage if supported.
  • Monitor parts count, merge mutations, disk usage, and query bytes read in system tables.
  • Restrict mutations; route OLTP writes to PostgreSQL, analytics reads to ClickHouse.
  • Wire Grafana or BI tools to read replicas; cap concurrent heavy queries with quotas.
  • Document schema migrations (ADD COLUMN with defaults is cheap; type changes are not).

Key takeaways

  • ClickHouse is a columnar OLAP engine for append-heavy analytics — not a PostgreSQL replacement for transactions.
  • MergeTree ORDER BY and partitioning determine query speed; design them from real dashboard filters, not defaults.
  • Materialized views deliver real-time rollups by transforming inserts into pre-aggregated tables.
  • Batch ingest, monitor part counts, and use TTL — operational health is merge health.
  • Pair ClickHouse with DuckDB for local exploration and PostgreSQL for system-of-record; each tier has a clear role.

Related reading