Guide
Apache Flink explained
A card swipe in Tokyo should update a fraud score before the receipt prints — not after tomorrow's batch job finishes. Apache Flink is the open-source stream processor built for that latency class: it processes events record-by-record, maintains keyed state across millions of accounts, windows by when the transaction actually happened, and recovers from node failures without double-charging or dropping alerts. Born at TU Berlin and now governed by the Apache Software Foundation, Flink powers real-time analytics at Alibaba, Uber, Netflix, and mid-size teams running on Kubernetes or managed Confluent Cloud. This guide covers Flink's JobManager and TaskManager architecture, the DataStream and Table APIs, event time and watermarks, checkpoints and exactly-once semantics, Kafka connectors, a Harbor Fleet fraud-detection worked example, an engine decision table, common pitfalls, and a production checklist alongside our stream processing overview, Kafka guide, and Spark guide.
What Flink is — and what it is not
Flink is a distributed stream processing engine where the primary abstraction is an unbounded sequence of records. Batch jobs are a special case: Flink reads a bounded dataset as a finite stream and applies the same operators. That unification means one API, one optimizer, and one runtime for both nightly backfills and sub-second alerting.
Flink is not a message broker like Kafka — it consumes from brokers and writes to sinks; it does not durably store your event log. Nor is it a batch warehouse engine like Spark optimized for scanning terabytes of Parquet with Catalyst SQL — though Flink SQL can read lake tables for hybrid pipelines. Flink wins when stateful, low-latency logic must run continuously: sessionization, stream-stream joins with time bounds, complex event processing, and incremental materialized views.
Core concepts
- Stream — unbounded (or bounded) sequence of events with timestamps and optional keys.
- Operator — map, filter, keyBy, window, join — transforms one or more streams.
- Keyed state — per-key counters, lists, maps stored in RocksDB or on-heap memory.
- Checkpoint — consistent snapshot of operator state and Kafka offsets for fault recovery.
- Watermark — signal that events with timestamp < t are unlikely to arrive late.
- Parallelism — number of concurrent subtasks per operator; maps to TaskManager slots.
- Savepoint — user-triggered checkpoint for version upgrades and job migrations.
Architecture: JobManager, TaskManagers, and slots
A Flink cluster has one JobManager (high availability setups run three with ZooKeeper or Kubernetes HA) and one or more TaskManagers. The JobManager accepts job submissions, builds the execution graph, schedules tasks, coordinates checkpoints, and exposes the web UI. TaskManagers are workers: each exposes a fixed number of slots — typically one slot per CPU core — and runs operator subtasks in those slots.
When you set global parallelism to 48, Flink creates 48 parallel instances of each
parallelizable operator. A keyBy(user_id) hashes keys to subtasks so
all events for the same user land on the same subtask — required for correct keyed
state. Misaligned parallelism (source partitions << operator parallelism)
leaves slots idle; source partitions >> downstream parallelism creates
rebalance overhead. Start with source partition count as a baseline and scale
bottlenecks shown in the Flink UI backpressure tab.
On Kubernetes, the Flink Kubernetes Operator or native application mode packages the JobManager and TaskManager as pods; session mode shares a long-lived cluster across short jobs. Application mode (one cluster per job) isolates failures and simplifies resource accounting for production.
DataStream API and Flink SQL
The DataStream API (Java, Scala, Python PyFlink) is the low-level
interface: create a StreamExecutionEnvironment, add sources, chain
transformations, and call env.execute(). Typical patterns:
stream.keyBy(Order::getUserId)— partition by key for stateful ops..window(TumblingEventTimeWindows.of(Time.minutes(5)))— aggregate per event-time window..process(new FraudRuleFunction())— custom keyed state inKeyedProcessFunction..connect(otherStream).keyBy(...).process(new CoProcessFunction())— stream-stream joins with timers.
Flink SQL / Table API compiles declarative queries to the same
runtime. Register Kafka topics as tables, write SELECT user_id, COUNT(*)
FROM payments WHERE ... GROUP BY TUMBLE(event_time, INTERVAL '1' MINUTE),
and Flink generates the physical plan. SQL suits analytics dashboards and teams
already fluent in warehouse dialects; DataStream suits bespoke CEP rules and
side-effect-heavy logic (calling external APIs with async I/O). Both interoperate:
convert DataStreams to Tables and back when you need SQL for one stage and custom
code for another.
Event time, watermarks, and windows
Flink's headline feature is first-class event-time processing. Assign
timestamps and watermarks in a WatermarkStrategy:
WatermarkStrategy.<Payment>forBoundedOutOfOrderness(Duration.ofSeconds(30))
tells the engine to wait up to 30 seconds for late events before closing a window.
Watermarks propagate downstream; when watermark passes 10:05, the 10:00–10:01 tumbling
window can emit its aggregate.
Window types
- Tumbling — fixed, non-overlapping slices (orders per minute).
- Sliding — overlapping windows (last 5 minutes updated every minute).
- Session — gaps in activity define variable-length sessions (click streams).
- Global — single window with custom triggers for manual control.
Allowed lateness routes very late events to side outputs instead of silently dropping them — critical for reconciling stream totals with batch ETL jobs. Measure p99 ingestion delay from Kafka lag and client clock skew before picking watermark slack; too tight drops valid revenue; too loose delays alerts.
State, checkpoints, and exactly-once
Stateful operators store data in Flink's managed state backends. Heap state is fast but limited by TaskManager RAM. RocksDB spills to local SSD — standard for production keyed state over gigabytes. Incremental checkpoints upload only changed RocksDB SST files to S3 or HDFS, keeping checkpoint duration stable as state grows.
Checkpoints are barrier-aligned snapshots: the JobManager injects checkpoint barriers into streams; when an operator receives a barrier on all inputs, it snapshots state and acknowledges. On failure, Flink restarts from the last completed checkpoint — redeploying tasks and restoring state bytes. With Kafka source and two-phase-commit sinks, Flink achieves exactly-once end-to-end: neither duplicate fraud blocks nor missed alerts after recovery.
Savepoints are manually triggered checkpoints stored with a stable format — use them to upgrade Flink versions, change parallelism, or fork jobs for testing without replaying from Kafka offset zero. Document savepoint paths in your runbook alongside disaster recovery procedures.
Delivery semantics recap
- At-most-once — no checkpoints; fastest, may lose data on crash.
- At-least-once — checkpoints without transactional sinks; duplicates possible downstream.
- Exactly-once — checkpointed state + idempotent or transactional sinks (Kafka, JDBC XA, Iceberg).
Connectors and the wider pipeline
Flink's Kafka connector reads with partition-aware parallelism and commits offsets only after checkpoint completion — the linchpin of exactly-once consumption. Common sources: Kafka, Kinesis, Pulsar, filesystem (Parquet for batch), JDBC CDC via Debezium-fed topics. Sinks: Kafka, Elasticsearch, JDBC upserts, ClickHouse, Prometheus metrics, webhooks.
Hybrid architectures are the norm: Flink computes real-time fraud scores and
publishes to a fraud_alerts topic; hourly
Spark jobs
reconcile aggregates into the
lakehouse
for BI. Schedule backfills and Flink job deploys with
Airflow;
monitor lag, checkpoint duration, and failed checkpoints in
Prometheus
via Flink's metrics reporter.
Worked example: Harbor Fleet payment fraud scorer
Harbor Fleet processes card authorizations for a fictional logistics marketplace. Requirements: flag users with >5 declined attempts in any 10-minute event-time window, and maintain a 24-hour rolling spend total per card for velocity checks. Architecture:
- Source — Kafka topic
payments.auth, 32 partitions, JSON events withevent_time,user_id,amount,status. - Watermark strategy — 45 s bounded out-of-orderness (mobile offline replay measured at p99 38 s).
- Keyed state —
MapState<String, Integer>for decline counts per window;ValueState<Double>for rolling spend with TTL 24 h on RocksDB. - Windows — tumbling 10-minute event-time on declines; processing-time timer for daily spend decay.
- Sinks —
fraud_alertsKafka topic (exactly-once producer); async side lookup to blocklist API with retry budget.
Parallelism 32 matches Kafka partitions. Checkpoint interval 60 s, incremental RocksDB to S3. After a TaskManager pod eviction, job restarts from checkpoint 1842 in 38 s — no duplicate alerts because the Kafka sink participates in the two-phase commit. Nightly Spark job compares Flink window counts to warehouse totals; allowed-lateness side output feeds a reconciliation dashboard.
Engine decision table
| Need | Prefer | Why |
|---|---|---|
| Sub-second stateful CEP on Kafka | Flink | Native event time, large keyed state, mature exactly-once |
| Lightweight transforms in Kafka microservices | Kafka Streams | No separate cluster; per-app JVM library |
| Nightly 5 TB Parquet ETL | Spark | Batch Catalyst optimizer, lakehouse integrations |
| Micro-batch dashboards every 10 s | Spark Structured Streaming | Team already on Spark SQL; latency acceptable |
| Managed AWS-only streaming SQL | Kinesis Data Analytics (Flink) or MSF | Ops offload; vendor lock-in trade-off |
| OLAP aggregations on stored columns | ClickHouse materialized views | Query-serving, not general stream programming |
Common pitfalls
- Processing-time windows for business metrics — delayed mobile batches skew revenue charts; use event time.
- Heap state at scale — OOM on TaskManagers when keyed state exceeds RAM; switch to RocksDB early.
- Ignoring backpressure — downstream sink slower than source fills network buffers; fix sink or add async I/O.
- Checkpoint timeout cascades — state too large or slow object store; enable incremental checkpoints and tune interval.
- Non-idempotent sinks with at-least-once — duplicate rows in JDBC without upsert keys; use exactly-once sink or dedupe table.
- Hot keys — one celebrity user_id owns a partition; salt keys and re-aggregate like Spark skew fixes.
- PyFlink on hot paths — Python operator overhead; keep critical CEP in Java/Scala or SQL.
Production checklist
- Pin Flink, connector, and Kafka client versions; test upgrades via savepoint restore on staging.
- Match source parallelism to Kafka partition count; scale TaskManager slots and CPU accordingly.
- Enable RocksDB incremental checkpoints to S3/HDFS; alert on checkpoint duration > interval.
- Configure event-time watermarks from measured late-data histograms, not defaults.
- Use exactly-once Kafka source/sink pair or document at-least-once dedupe downstream.
- Export Flink metrics (records in/out, lag, backpressure, last checkpoint size) to Prometheus/Grafana.
- Define savepoint-before-deploy in CI/CD; never change
keyByfields without state migration plan. - Reconcile stream aggregates with batch warehouse jobs; monitor allowed-lateness side output volume.
Key takeaways
- Flink is a stream-first engine — batch is a bounded stream, not a separate product bolted on.
- Event time, watermarks, and keyed state are the core skills; master them before tuning parallelism.
- Checkpoints and RocksDB make large state fault-tolerant; exactly-once needs transactional sinks too.
- Pair Flink with Kafka for ingestion, Spark for heavy batch reconciliation, and Airflow for orchestration.
- Read the Flink UI backpressure and checkpoint tabs first when latency regresses — guesses are expensive.
Related reading
- Stream processing explained — event time, windows, and delivery semantics across engines
- Apache Kafka explained — the log Flink reads and writes with exactly-once offsets
- Apache Spark explained — batch and micro-batch compute for lakehouse ETL
- Change data capture explained — database binlog streams into Flink via Kafka