Guide

Stream processing explained

A nightly ETL job that aggregates yesterday's clicks is batch processing. A pipeline that updates a fraud score the moment a card is swiped is stream processing. Both consume events; the difference is when results materialize and how the system handles data that never stops arriving. Stream processors read unbounded sequences — Kafka topics, Kinesis shards, database change logs — and apply filters, aggregations, joins, and alerts with seconds (or milliseconds) of latency. That sounds simple until clocks disagree: a mobile client buffers clicks offline and replays them hours later, or a warehouse replica lags and emits stale updates. This guide explains event time vs processing time, windowing and watermarks, stateful operators, delivery semantics, how Kafka Streams and Flink differ, and when streaming beats batch for analytics and automation — building on message queues and event-driven architecture.

Batch, micro-batch, and true streaming

Batch processing collects a bounded dataset — "all orders from Tuesday" — runs a job, and finishes. Hadoop MapReduce and classic Spark jobs fit this model. Latency is measured in minutes to hours; correctness is easy because the input set is fixed.

Micro-batch engines (early Spark Streaming, some managed "streaming" SQL) chop the infinite stream into small batches every few seconds. You get near-real-time dashboards with simpler fault tolerance, but tail latency is bounded below by the batch interval and stateful logic can feel awkward.

True stream processing treats each record as it arrives: operators maintain incremental state, windows close based on event-time rules, and frameworks like Apache Flink or Kafka Streams pipeline records record-by-record. Latency drops to sub-second for many workloads; complexity moves into watermarks, state backends, and exactly-once configuration.

Hybrid architectures are normal: stream processors compute real-time KPIs and fan out alerts, while nightly jobs in your warehouse reconcile totals, backfill history, and train models on complete snapshots.

Event time, processing time, and ingestion time

Three timestamps appear on every event, and conflating them causes silent bugs:

  • Event time — when the thing actually happened (click timestamp from the client, created_at on the order row).
  • Ingestion time — when the broker or collector received the record (Kafka timestamp type CreateTime).
  • Processing time — when your operator runs (wall clock on the stream worker).

Analytics on "orders per minute" should almost always window by event time — otherwise a delayed mobile batch looks like a traffic spike at 3 a.m. when the server finally ingests it. Processing-time windows are acceptable for rough monitoring ("is the pipeline keeping up?") but wrong for business metrics tied to user behavior.

Late data arrives after the window you thought was closed. Stream frameworks use watermarks — monotonic estimates of "we are unlikely to see events older than t" — to decide when to finalize a window. A watermark at 10:05 might close the 10:00–10:01 tumbling window while still allowing a short allowed lateness grace period for stragglers. Tune these knobs from measured p99 ingestion delay, not defaults.

Windowing: tumbling, sliding, and session

Windows group unbounded streams into finite chunks for aggregation (COUNT, SUM, AVG, percentiles):

  • Tumbling windows — fixed, non-overlapping intervals (every 1-minute pageview count). Simplest to reason about; common for dashboards.
  • Sliding windows — fixed length, overlapping (last 5 minutes updated every minute). Useful for rolling averages and trend detection.
  • Session windows — gap-based: a new session starts after N minutes of inactivity per key (user id, device id). Essential for funnel and engagement metrics; variable window length makes them state-heavy.

Every windowed aggregation needs a keyGROUP BY user_id in SQL becomes keyBy(userId) in Flink. Key choice drives partitioning: hot keys (a viral post, a global config flag) create skew where one task owns most traffic. Salting, two-phase aggregation, or splitting hot keys are standard fixes.

Stateful operators and joins

Stateless map/filter steps are easy to scale. Real pipelines need state: running counts, deduplication sets, open sessions, and join buffers. Frameworks persist operator state to RocksDB (embedded) or remote stores, checkpointing periodically so workers can restart after failure without recomputing from scratch.

Common join patterns:

  • Stream–table join — enrich click events with user profile rows from a database changelog (CDC) or compacted Kafka topic. The table side is materialized as queryable state.
  • Stream–stream join — correlate two live feeds (ad impression + conversion within 30 minutes). Requires buffering both sides within a time bound; memory grows with join window width.
  • Temporal joins — SQL models like Flink's FOR SYSTEM_TIME AS OF pick the dimension row valid at event time — critical when reference data changes (price lists, FX rates).

State size is the hidden cost center. A 24-hour dedup set on a high-QPS topic can exceed RAM; TTL and compaction policies must be designed upfront, not after the first OOM at 2 a.m.

Delivery semantics: at-most-once, at-least-once, exactly-once

Like message queues, stream processors sit on the same spectrum:

  • At-most-once — fast, may drop records on failure. Fine for approximate metrics where loss is acceptable.
  • At-least-once — retries on failure; duplicates possible. Requires idempotent sinks (upsert by primary key, idempotency keys) or downstream dedup.
  • Exactly-once — end-to-end transactional guarantees between source, state, and sink (Kafka transactions + idempotent producers, Flink two-phase commit). Higher overhead; sink must participate (many databases and object stores do; some HTTP APIs do not).

"Exactly-once" in marketing often means exactly-once effect on the sink row, not that no duplicate bytes ever crossed the wire. Document what your pipeline actually promises before finance wires KPIs to it.

Engine landscape: Kafka Streams, Flink, Spark, managed services

EngineDeployment modelStrengthsWatch outs
Kafka Streams Library in your JVM app Tight Kafka integration, no separate cluster, good for medium complexity State sizing on app nodes; ops tied to app deploys
Apache Flink Dedicated cluster / managed (Ververica, Confluent) Event time, large state, SQL, CEPL, mature exactly-once Cluster ops learning curve; savepoint discipline
Spark Structured Streaming Spark cluster Unified batch + stream API, warehouse integrations Micro-batch latency floor; stateful ops less nimble than Flink
ksqlDB / Flink SQL Managed or self-hosted Declarative filters and aggregates for teams without Java/Scala Complex joins and UDF limits; version upgrades
AWS Kinesis / GCP Dataflow Cloud managed Low ops, autoscale, IAM integration Vendor cost at scale; egress and shard limits

Pick the engine that matches your team's ops muscle and latency SLO — not the one with the most conference talks. A Kafka-native shop with modest windowed aggregates often ships faster on Kafka Streams than on a new Flink cluster.

Typical use cases

  • Real-time analytics — live dashboards, funnel counters, anomaly spikes on payment volume.
  • Fraud and risk — rule engines and ML features scored per transaction before authorization completes.
  • Operational alerting — SLO burn alerts, error-rate thresholds, inventory low-stock triggers.
  • Search and cache sync — CDC-driven index updates instead of full reindex jobs.
  • Materialized views — maintain read-optimized projections for CQRS read sides without polling OLTP.
  • IoT and telemetry — rolling aggregates on sensor feeds with edge ingestion.

Decision table: stream vs batch

RequirementPrefer streamingPrefer batch
Latency SLO Sub-minute actions or alerts Hourly or daily is acceptable
Data completeness Approximate live view OK Must reconcile 100% of historical rows
Join complexity Enrichment from CDC dimensions Many-way joins across huge history
Correctness tolerance Idempotent sinks; late-data rules documented Single bounded run with audit trail
Team skills Existing Kafka ops and on-call SQL + Airflow maturity, no 24/7 stream on-call
Cost model Always-on compute justified by revenue or risk Spot/preemptible batch cheaper for bursty TB scans

Common mistakes

  • Windowing on processing time for user-facing metrics — shifts spikes to quiet hours when mobile clients sync.
  • No watermark / lateness strategy — either drop legitimate late events or never close windows.
  • Unbounded state — dedup sets and session maps without TTL.
  • Hot keys — one partition at 100% CPU while others idle.
  • Assuming exactly-once everywhere — HTTP webhooks and email sinks are usually at-least-once.
  • Skipping backfill path — stream for live, batch for history; without reconciliation, dashboards disagree with finance.
  • Schema chaos — breaking Avro/Protobuf changes without compatibility checks stall every consumer.

Practitioner checklist

  • Define event-time field, expected p99 lateness, and business tolerance for corrected aggregates.
  • Choose tumbling vs session windows per metric; document key cardinality and hot-key mitigations.
  • Size state: bytes per key × expected keys × replication factor; set RocksDB or remote state backend limits.
  • Configure checkpoints / savepoints and test failover — kill a task during load before launch.
  • Implement idempotent sinks or explicit dedup for at-least-once paths.
  • Monitor consumer lag, watermark lag, and state growth — alert before disk fills.
  • Register schemas with backward-compatible evolution (Schema Registry or equivalent).
  • Plan Lambda architecture: stream for live, batch job to reconcile warehouse totals nightly.

Key takeaways

  • Stream processing computes on unbounded event feeds with low latency; batch still owns complete historical reconciliation.
  • Event time + watermarks are how you get correct windows when data arrives late or out of order.
  • Stateful operators (aggregations, joins, sessions) drive memory, checkpointing, and scaling behavior.
  • Delivery semantics are a pipeline property — design sinks for the guarantee you actually need.
  • Engine choice (Kafka Streams, Flink, Spark, managed) is an ops and latency trade-off, not a purity contest.

Related reading