Guide

Apache Flink explained

A card swipe in Tokyo should update a fraud score before the receipt prints — not after tomorrow's batch job finishes. Apache Flink is the open-source stream processor built for that latency class: it processes events record-by-record, maintains keyed state across millions of accounts, windows by when the transaction actually happened, and recovers from node failures without double-charging or dropping alerts. Born at TU Berlin and now governed by the Apache Software Foundation, Flink powers real-time analytics at Alibaba, Uber, Netflix, and mid-size teams running on Kubernetes or managed Confluent Cloud. This guide covers Flink's JobManager and TaskManager architecture, the DataStream and Table APIs, event time and watermarks, checkpoints and exactly-once semantics, Kafka connectors, a Harbor Fleet fraud-detection worked example, an engine decision table, common pitfalls, and a production checklist alongside our stream processing overview, Kafka guide, and Spark guide.

What Flink is — and what it is not

Flink is a distributed stream processing engine where the primary abstraction is an unbounded sequence of records. Batch jobs are a special case: Flink reads a bounded dataset as a finite stream and applies the same operators. That unification means one API, one optimizer, and one runtime for both nightly backfills and sub-second alerting.

Flink is not a message broker like Kafka — it consumes from brokers and writes to sinks; it does not durably store your event log. Nor is it a batch warehouse engine like Spark optimized for scanning terabytes of Parquet with Catalyst SQL — though Flink SQL can read lake tables for hybrid pipelines. Flink wins when stateful, low-latency logic must run continuously: sessionization, stream-stream joins with time bounds, complex event processing, and incremental materialized views.

Core concepts

  • Stream — unbounded (or bounded) sequence of events with timestamps and optional keys.
  • Operator — map, filter, keyBy, window, join — transforms one or more streams.
  • Keyed state — per-key counters, lists, maps stored in RocksDB or on-heap memory.
  • Checkpoint — consistent snapshot of operator state and Kafka offsets for fault recovery.
  • Watermark — signal that events with timestamp < t are unlikely to arrive late.
  • Parallelism — number of concurrent subtasks per operator; maps to TaskManager slots.
  • Savepoint — user-triggered checkpoint for version upgrades and job migrations.

Architecture: JobManager, TaskManagers, and slots

A Flink cluster has one JobManager (high availability setups run three with ZooKeeper or Kubernetes HA) and one or more TaskManagers. The JobManager accepts job submissions, builds the execution graph, schedules tasks, coordinates checkpoints, and exposes the web UI. TaskManagers are workers: each exposes a fixed number of slots — typically one slot per CPU core — and runs operator subtasks in those slots.

When you set global parallelism to 48, Flink creates 48 parallel instances of each parallelizable operator. A keyBy(user_id) hashes keys to subtasks so all events for the same user land on the same subtask — required for correct keyed state. Misaligned parallelism (source partitions << operator parallelism) leaves slots idle; source partitions >> downstream parallelism creates rebalance overhead. Start with source partition count as a baseline and scale bottlenecks shown in the Flink UI backpressure tab.

On Kubernetes, the Flink Kubernetes Operator or native application mode packages the JobManager and TaskManager as pods; session mode shares a long-lived cluster across short jobs. Application mode (one cluster per job) isolates failures and simplifies resource accounting for production.

DataStream API and Flink SQL

The DataStream API (Java, Scala, Python PyFlink) is the low-level interface: create a StreamExecutionEnvironment, add sources, chain transformations, and call env.execute(). Typical patterns:

  • stream.keyBy(Order::getUserId) — partition by key for stateful ops.
  • .window(TumblingEventTimeWindows.of(Time.minutes(5))) — aggregate per event-time window.
  • .process(new FraudRuleFunction()) — custom keyed state in KeyedProcessFunction.
  • .connect(otherStream).keyBy(...).process(new CoProcessFunction()) — stream-stream joins with timers.

Flink SQL / Table API compiles declarative queries to the same runtime. Register Kafka topics as tables, write SELECT user_id, COUNT(*) FROM payments WHERE ... GROUP BY TUMBLE(event_time, INTERVAL '1' MINUTE), and Flink generates the physical plan. SQL suits analytics dashboards and teams already fluent in warehouse dialects; DataStream suits bespoke CEP rules and side-effect-heavy logic (calling external APIs with async I/O). Both interoperate: convert DataStreams to Tables and back when you need SQL for one stage and custom code for another.

Event time, watermarks, and windows

Flink's headline feature is first-class event-time processing. Assign timestamps and watermarks in a WatermarkStrategy: WatermarkStrategy.<Payment>forBoundedOutOfOrderness(Duration.ofSeconds(30)) tells the engine to wait up to 30 seconds for late events before closing a window. Watermarks propagate downstream; when watermark passes 10:05, the 10:00–10:01 tumbling window can emit its aggregate.

Window types

  • Tumbling — fixed, non-overlapping slices (orders per minute).
  • Sliding — overlapping windows (last 5 minutes updated every minute).
  • Session — gaps in activity define variable-length sessions (click streams).
  • Global — single window with custom triggers for manual control.

Allowed lateness routes very late events to side outputs instead of silently dropping them — critical for reconciling stream totals with batch ETL jobs. Measure p99 ingestion delay from Kafka lag and client clock skew before picking watermark slack; too tight drops valid revenue; too loose delays alerts.

State, checkpoints, and exactly-once

Stateful operators store data in Flink's managed state backends. Heap state is fast but limited by TaskManager RAM. RocksDB spills to local SSD — standard for production keyed state over gigabytes. Incremental checkpoints upload only changed RocksDB SST files to S3 or HDFS, keeping checkpoint duration stable as state grows.

Checkpoints are barrier-aligned snapshots: the JobManager injects checkpoint barriers into streams; when an operator receives a barrier on all inputs, it snapshots state and acknowledges. On failure, Flink restarts from the last completed checkpoint — redeploying tasks and restoring state bytes. With Kafka source and two-phase-commit sinks, Flink achieves exactly-once end-to-end: neither duplicate fraud blocks nor missed alerts after recovery.

Savepoints are manually triggered checkpoints stored with a stable format — use them to upgrade Flink versions, change parallelism, or fork jobs for testing without replaying from Kafka offset zero. Document savepoint paths in your runbook alongside disaster recovery procedures.

Delivery semantics recap

  • At-most-once — no checkpoints; fastest, may lose data on crash.
  • At-least-once — checkpoints without transactional sinks; duplicates possible downstream.
  • Exactly-once — checkpointed state + idempotent or transactional sinks (Kafka, JDBC XA, Iceberg).

Connectors and the wider pipeline

Flink's Kafka connector reads with partition-aware parallelism and commits offsets only after checkpoint completion — the linchpin of exactly-once consumption. Common sources: Kafka, Kinesis, Pulsar, filesystem (Parquet for batch), JDBC CDC via Debezium-fed topics. Sinks: Kafka, Elasticsearch, JDBC upserts, ClickHouse, Prometheus metrics, webhooks.

Hybrid architectures are the norm: Flink computes real-time fraud scores and publishes to a fraud_alerts topic; hourly Spark jobs reconcile aggregates into the lakehouse for BI. Schedule backfills and Flink job deploys with Airflow; monitor lag, checkpoint duration, and failed checkpoints in Prometheus via Flink's metrics reporter.

Worked example: Harbor Fleet payment fraud scorer

Harbor Fleet processes card authorizations for a fictional logistics marketplace. Requirements: flag users with >5 declined attempts in any 10-minute event-time window, and maintain a 24-hour rolling spend total per card for velocity checks. Architecture:

  • Source — Kafka topic payments.auth, 32 partitions, JSON events with event_time, user_id, amount, status.
  • Watermark strategy — 45 s bounded out-of-orderness (mobile offline replay measured at p99 38 s).
  • Keyed stateMapState<String, Integer> for decline counts per window; ValueState<Double> for rolling spend with TTL 24 h on RocksDB.
  • Windows — tumbling 10-minute event-time on declines; processing-time timer for daily spend decay.
  • Sinksfraud_alerts Kafka topic (exactly-once producer); async side lookup to blocklist API with retry budget.

Parallelism 32 matches Kafka partitions. Checkpoint interval 60 s, incremental RocksDB to S3. After a TaskManager pod eviction, job restarts from checkpoint 1842 in 38 s — no duplicate alerts because the Kafka sink participates in the two-phase commit. Nightly Spark job compares Flink window counts to warehouse totals; allowed-lateness side output feeds a reconciliation dashboard.

Engine decision table

NeedPreferWhy
Sub-second stateful CEP on KafkaFlinkNative event time, large keyed state, mature exactly-once
Lightweight transforms in Kafka microservicesKafka StreamsNo separate cluster; per-app JVM library
Nightly 5 TB Parquet ETLSparkBatch Catalyst optimizer, lakehouse integrations
Micro-batch dashboards every 10 sSpark Structured StreamingTeam already on Spark SQL; latency acceptable
Managed AWS-only streaming SQLKinesis Data Analytics (Flink) or MSFOps offload; vendor lock-in trade-off
OLAP aggregations on stored columnsClickHouse materialized viewsQuery-serving, not general stream programming

Common pitfalls

  • Processing-time windows for business metrics — delayed mobile batches skew revenue charts; use event time.
  • Heap state at scale — OOM on TaskManagers when keyed state exceeds RAM; switch to RocksDB early.
  • Ignoring backpressure — downstream sink slower than source fills network buffers; fix sink or add async I/O.
  • Checkpoint timeout cascades — state too large or slow object store; enable incremental checkpoints and tune interval.
  • Non-idempotent sinks with at-least-once — duplicate rows in JDBC without upsert keys; use exactly-once sink or dedupe table.
  • Hot keys — one celebrity user_id owns a partition; salt keys and re-aggregate like Spark skew fixes.
  • PyFlink on hot paths — Python operator overhead; keep critical CEP in Java/Scala or SQL.

Production checklist

  • Pin Flink, connector, and Kafka client versions; test upgrades via savepoint restore on staging.
  • Match source parallelism to Kafka partition count; scale TaskManager slots and CPU accordingly.
  • Enable RocksDB incremental checkpoints to S3/HDFS; alert on checkpoint duration > interval.
  • Configure event-time watermarks from measured late-data histograms, not defaults.
  • Use exactly-once Kafka source/sink pair or document at-least-once dedupe downstream.
  • Export Flink metrics (records in/out, lag, backpressure, last checkpoint size) to Prometheus/Grafana.
  • Define savepoint-before-deploy in CI/CD; never change keyBy fields without state migration plan.
  • Reconcile stream aggregates with batch warehouse jobs; monitor allowed-lateness side output volume.

Key takeaways

  • Flink is a stream-first engine — batch is a bounded stream, not a separate product bolted on.
  • Event time, watermarks, and keyed state are the core skills; master them before tuning parallelism.
  • Checkpoints and RocksDB make large state fault-tolerant; exactly-once needs transactional sinks too.
  • Pair Flink with Kafka for ingestion, Spark for heavy batch reconciliation, and Airflow for orchestration.
  • Read the Flink UI backpressure and checkpoint tabs first when latency regresses — guesses are expensive.

Related reading