Guide
ClickHouse explained
Your product dashboard runs SELECT date, COUNT(*) FROM events GROUP BY 1
against PostgreSQL. At 200 million rows it takes 45 seconds and locks the
replica. You add read replicas, then caching, then pre-aggregated tables
maintained by cron — each layer adds lag and operational debt. ClickHouse
is an open-source, column-oriented OLAP database designed for exactly this
workload: append-heavy event streams, sub-second aggregations over billions of
rows, and high-cardinality GROUP BY queries that crush row stores. Originally
built at Yandex for web analytics, ClickHouse powers observability backends,
product analytics, ad-tech reporting, and financial tick data worldwide. This
guide covers columnar storage and the MergeTree engine family, primary key vs
ORDER BY design, partitioning and TTL, materialized views,
replication and sharding, ingest patterns, a Harbor Fleet clickstream worked
example, a database decision table, common pitfalls, and a production
checklist alongside our
DuckDB guide,
data warehouse overview,
and
ETL/ELT pipelines guide.
What ClickHouse is — and what it is not
ClickHouse is a distributed columnar database optimized for analytical queries (OLAP), not transactional row updates (OLTP). Data is stored by column, compressed with codecs tuned per type (LZ4, ZSTD, Delta, Gorilla for timestamps), and scanned in vectorized batches across CPU cores. Inserts append immutable parts; background merges compact small parts into larger ones — there is no in-place row update like PostgreSQL's heap page.
ClickHouse is not a replacement for
PostgreSQL
as your system of record. Use it as an analytics sink: events, metrics, logs,
and denormalized reporting tables fed by
CDC,
Kafka, or batch
ELT jobs.
Mutations (ALTER TABLE ... UPDATE) exist but are asynchronous and
expensive — design schemas for append and aggregate, not frequent row edits.
Core concepts
- Table engine — storage and behavior contract;
MergeTreefamily is the default for analytics. - Part — immutable on-disk chunk created per insert batch; merged in the background.
- Primary key / ORDER BY — sparse index over sorted data; determines data layout and prune efficiency (not a uniqueness constraint).
- Partition — logical slice (often by month) enabling partition-level drops and targeted merges.
- Granule — default 8192-row block indexed by the primary key; min/max statistics skip irrelevant granules.
- Materialized view — trigger that writes transformed rows to a target table on each insert.
- Replication — ZooKeeper or ClickHouse Keeper coordinates identical copies via
ReplicatedMergeTree.
MergeTree engines: the foundation
MergeTree is the workhorse. On insert, ClickHouse sorts rows by
the ORDER BY expression, writes a new part, and updates the sparse
primary-key index. Queries with predicates matching the sort key prefix skip
granules whose min/max bounds exclude the filter range — the same zone-map
idea as
DuckDB
and Parquet, but managed continuously as data arrives.
CREATE TABLE events (
event_time DateTime,
user_id UInt64,
event_type LowCardinality(String),
properties String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_type, event_time, user_id)
TTL event_time + INTERVAL 90 DAY;
Specialized variants solve common patterns without application-side dedup logic:
- ReplacingMergeTree — keeps the latest row per sort key during merges (eventual dedup; use
FINALor argMax in queries until merge completes). - SummingMergeTree — sums numeric columns for matching keys during merges; ideal for pre-aggregated counters.
- AggregatingMergeTree — stores intermediate aggregate states (
AggregateFunction) for rollups at query time. - CollapsingMergeTree / VersionedCollapsingMergeTree — sign column (+1/-1) or version for change-data streams.
Designing ORDER BY for your queries
Put the highest-selectivity filter columns first in
ORDER BY, then time. If dashboards always filter
WHERE event_type = 'purchase' and group by day, leading with
event_type beats leading with user_id. Low-cardinality
dimensions benefit from the LowCardinality(String) type — dictionary
encoding shrinks storage and speeds GROUP BY. Test with
EXPLAIN indexes = 1 to see how many granules a query reads.
Partitioning, TTL, and storage management
PARTITION BY toYYYYMM(event_time) creates one partition per calendar
month. Benefits: drop old data with ALTER TABLE DROP PARTITION
instead of expensive DELETE scans; restrict merges to hot partitions; align
backups with business retention. Avoid ultra-granular partitions (per day on
low-volume tables) — each partition carries metadata overhead and too many
small parts slow merges.
TTL rules move rows to cold storage (TO VOLUME) or
delete them when timestamps expire — automating retention without cron jobs.
Combine TTL with tiered storage: NVMe for last 7 days, S3-backed volume for
90-day archive. Compression codecs per column further cut disk: ZSTD on JSON
strings, Delta on monotonic IDs, DoubleDelta on timestamps.
Merge tuning
Background merges are ClickHouse's compaction loop. Watch
system.parts for part count — hundreds of tiny parts per partition
mean merge backlog and slower queries. Batch inserts (10k–100k rows per block)
and avoid single-row inserts from application code. The
max_parts_in_total and parts_to_throw_insert settings
protect against runaway part explosion.
Materialized views and real-time rollups
Materialized views in ClickHouse are insert triggers, not stored
query results like PostgreSQL. When rows land in a source table, the MV's
SELECT transforms and inserts into a target table — enabling
real-time dashboards without scanning raw events every time.
CREATE TABLE events_daily (
day Date,
event_type LowCardinality(String),
event_count UInt64
)
ENGINE = SummingMergeTree()
ORDER BY (day, event_type);
CREATE MATERIALIZED VIEW events_daily_mv TO events_daily AS
SELECT toDate(event_time) AS day,
event_type,
count() AS event_count
FROM events
GROUP BY day, event_type;
New inserts into events automatically populate
events_daily. For funnel and session metrics, chain MVs or use
AggregatingMergeTree with uniqState /
quantileState for approximate distinct counts at scale. Backfill
historical data with INSERT INTO events_daily SELECT ... after
creating the MV — MVs do not retroactively process existing rows.
Ingest, replication, and distributed queries
Ingest patterns
- Batch INSERT — preferred; use FORMAT JSONEachRow, Native, or Parquet over HTTP/Native protocol.
- Kafka engine table — ClickHouse consumes a topic directly; MV transforms into MergeTree.
- S3 / GCS table functions —
INSERT INTO t SELECT * FROM s3(...)for lake ingestion, similar to DuckDB's file queries but cluster-scale. - clickhouse-client —
--max_insert_block_sizeand async inserts for high-throughput pipelines.
Use async_insert when many small writers hit one node — the server
buffers and flushes larger parts. For exactly-once semantics from Kafka, pair
with idempotent consumers and ReplacingMergeTree keyed on event ID.
Replication and sharding
ReplicatedMergeTree keeps N copies in sync via a coordination service
(ClickHouse Keeper — a ZooKeeper-compatible embedded consensus layer). Sharding
spreads data across nodes with a Distributed table that routes
inserts and fans out SELECTs. The typical cluster layout: 3 shards × 2 replicas
for both throughput and fault tolerance. Distributed joins are expensive — prefer
denormalized wide tables or pre-joined materialized views for dashboard queries.
Worked example: Harbor Fleet clickstream analytics
Harbor Fleet's web storefront emits JSON events to Kafka:
(event_time, session_id, user_id, page, action, revenue). The SRE
team needs a Grafana dashboard with page views per minute, conversion rate, and
90-day retention without hammering PostgreSQL.
CREATE TABLE clickstream (
event_time DateTime,
session_id UUID,
user_id UInt64,
page LowCardinality(String),
action LowCardinality(String),
revenue Decimal(10,2)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/clickstream', '{replica}')
PARTITION BY toYYYYMM(event_time)
ORDER BY (page, event_time)
TTL event_time + INTERVAL 90 DAY;
-- Kafka ingest (simplified)
CREATE TABLE clickstream_kafka ENGINE = Kafka
SETTINGS kafka_broker_list = 'kafka:9092',
kafka_topic_list = 'harbor.clicks',
kafka_format = 'JSONEachRow';
CREATE MATERIALIZED VIEW clickstream_kafka_mv TO clickstream AS
SELECT * FROM clickstream_kafka;
-- Dashboard query: conversion rate last hour
SELECT page,
countIf(action = 'view') AS views,
countIf(action = 'purchase') AS purchases,
purchases / views AS conversion
FROM clickstream
WHERE event_time >= now() - INTERVAL 1 HOUR
GROUP BY page
ORDER BY views DESC
LIMIT 20;
At 50k events per second, three ClickHouse nodes sustain sub-200ms dashboard
queries because only the page and event_time columns
are read for this aggregation. Revenue totals use a separate SummingMergeTree MV
keyed by day. Alerts hook into
Grafana
via the ClickHouse data source; slow-query logging feeds
Prometheus
metrics on merge lag and insert throughput.
Decision table: ClickHouse vs alternatives
| Need | Reach for | Why |
|---|---|---|
| Billions of events, real-time dashboards, cluster scale | ClickHouse | Columnar MergeTree, MV rollups, replication, proven at web-analytics scale |
| Local ad hoc analytics on Parquet/CSV files | DuckDB | Embedded, no cluster ops, ideal for laptop and CI |
| Managed multi-tenant warehouse, SQL BI at enterprise scale | Snowflake / BigQuery | Separation of storage/compute, governance, less ops — higher cost |
| Transactional CRUD, foreign keys, row-level locking | PostgreSQL | OLTP-native; use as source, not analytics sink at billion-row scale |
| Full-text search and log exploration with relevance scoring | Elasticsearch | Inverted indexes and BM25; aggregations slower than ClickHouse at extreme cardinality |
| Time-series metrics with Prometheus ecosystem | Prometheus / VictoriaMetrics | Pull model and PromQL; ClickHouse wins for rich dimensional event analytics |
Query optimization habits
ClickHouse rewards the same discipline as our SQL optimization guide, with OLAP-specific twists:
- Filter on partition key and
ORDER BYprefix columns first. - Avoid
SELECT *— read only columns you aggregate. - Prefer
uniqCombinedoruniqHLL12over exactCOUNT(DISTINCT)on high-cardinality keys. - Use
PREWHEREfor selective predicates — ClickHouse evaluates them before reading other columns. - Sample large tables with
SAMPLE 0.1for exploratory queries. - Check
system.query_logfor bytes read and mark whether indexes were used.
Common pitfalls
- Single-row inserts — creates part explosion and merge storms; batch or use async_insert.
- Wrong ORDER BY — leading with high-cardinality
user_idwhen queries filter byevent_typeforces full scans. - Too many partitions — daily partitions on low-volume tables bloat metadata and slow startup.
- Expecting immediate dedup — ReplacingMergeTree deduplicates at merge time; query with
FINALor argMax only when needed (FINAL is expensive). - Mutations for frequent updates — rewrites parts asynchronously; use CollapsingMergeTree or external OLTP instead.
- Distributed JOIN across shards — shuffles massive datasets; denormalize or pre-aggregate.
- Ignoring replication lag — monitor
system.replicasfor queue backlog after node failures. - No TTL or partition drops — disk fills; merges cannot keep up with ingest on full volumes.
Production checklist
- Choose MergeTree variant matching dedup/rollup needs; document ORDER BY rationale against top 10 queries.
- Partition by month (or coarser) on time-series tables; set TTL aligned with compliance retention.
- Deploy ReplicatedMergeTree with Keeper quorum; test failover and replica catch-up.
- Ingest via Kafka engine or batched inserts (10k+ rows); enable async_insert for small writers.
- Create materialized views for dashboard aggregates; backfill historical rollups explicitly.
- Configure per-column codecs; tier cold data to object storage if supported.
- Monitor parts count, merge mutations, disk usage, and query bytes read in system tables.
- Restrict mutations; route OLTP writes to PostgreSQL, analytics reads to ClickHouse.
- Wire Grafana or BI tools to read replicas; cap concurrent heavy queries with quotas.
- Document schema migrations (ADD COLUMN with defaults is cheap; type changes are not).
Key takeaways
- ClickHouse is a columnar OLAP engine for append-heavy analytics — not a PostgreSQL replacement for transactions.
- MergeTree ORDER BY and partitioning determine query speed; design them from real dashboard filters, not defaults.
- Materialized views deliver real-time rollups by transforming inserts into pre-aggregated tables.
- Batch ingest, monitor part counts, and use TTL — operational health is merge health.
- Pair ClickHouse with DuckDB for local exploration and PostgreSQL for system-of-record; each tier has a clear role.
Related reading
- DuckDB explained — embedded OLAP for local Parquet analytics before promoting to ClickHouse scale
- Data warehouses and lakehouses explained — where ClickHouse fits in the modern analytics stack
- ETL and ELT data pipelines explained — feeding ClickHouse from lakes, CDC, and orchestrators
- Apache Kafka explained — streaming ingest into ClickHouse Kafka engine tables