Guide

Apache Spark explained

Your data team lands 400 GB of order events in S3 every night. A single-machine Python script takes nine hours and dies on out-of-memory errors halfway through a join. Apache Spark is the open-source distributed compute engine behind most large-scale ETL, feature engineering, and batch analytics jobs: it splits datasets into partitions, runs transformations in parallel across executors, and optimizes SQL plans with the Catalyst optimizer. Originally built at UC Berkeley's AMPLab, Spark powers lakehouse pipelines at Databricks, Netflix, Apple, and thousands of mid-size teams running on Kubernetes or EMR. This guide covers Spark's driver-executor architecture, RDDs vs DataFrames, lazy evaluation and stages, partitioning and shuffle — the dominant cost center — Spark SQL and Structured Streaming, a Harbor Supply nightly ETL worked example, an engine decision table, common pitfalls, and a production checklist alongside our ETL/ELT pipelines guide, Kafka guide, and lakehouse overview.

What Spark is — and what it is not

Spark is a general-purpose distributed data processing engine. You write transformations (map, filter, join, groupBy) in Python (PySpark), Scala, SQL, or R; Spark schedules them across a cluster and handles fault tolerance by recomputing lost partitions from lineage. It is optimized for batch and micro-batch workloads over data already stored in files (Parquet, JSON, ORC), databases, or streams — not for millisecond OLTP transactions.

Spark is not a database like ClickHouse or PostgreSQL. It does not own storage; it reads from object stores, HDFS, or JDBC sources, processes in memory and on disk spill, and writes results back out. Nor is it a workflow scheduler — pair Spark jobs with Airflow or Dagster for dependency graphs, retries, and SLAs. Spark executes the heavy step; orchestrators coordinate when it runs.

Core concepts

  • SparkSession — entry point in Spark 2.x+; holds the SparkContext, SQL catalog, and configuration.
  • DataFrame / Dataset — typed tabular abstraction with Catalyst optimization; preferred over raw RDDs for most work.
  • RDD — resilient distributed dataset; low-level partition collection with lineage graph (legacy but still under the hood).
  • Transformation — lazy operation (filter, select, join) that builds a DAG without executing yet.
  • Action — triggers execution (count, collect, write) and materializes stages.
  • Partition — logical slice of a DataFrame; parallelism unit mapped to tasks on executors.
  • Shuffle — cross-partition data exchange (groupBy, join, repartition); expensive network and disk I/O.
  • Stage — set of tasks separated by shuffle boundaries; Spark schedules stages sequentially or in parallel where possible.

Architecture: driver, executors, and cluster managers

A Spark application has one driver process (your main() or notebook kernel) and one or more executors on worker nodes. The driver parses your code, builds the logical and physical plans, breaks work into tasks, and ships them to executors. Executors run tasks, cache partitions in memory or spill to disk, and return results or write output.

Cluster managers — standalone, YARN, Mesos, or Kubernetes — allocate CPU and memory containers. On K8s, the Spark Operator or spark-submit --master k8s://... spins up driver and executor pods; dynamic allocation scales executor count with queue depth. Right-size executor memory (heap + overhead) and cores: too few executors underutilize the cluster; too many tiny executors add scheduling overhead.

Lazy evaluation and the DAG

Transformations are lazy: chaining df.filter(...).select(...).groupBy(...) only records a lineage graph. An action like df.write.parquet(...) or df.count() triggers Catalyst to optimize the logical plan (predicate pushdown, column pruning, constant folding) and the physical plan (broadcast hash join vs sort-merge join). The UI at port 4040 shows stages, shuffle read/write bytes, and straggler tasks — your first stop when a job runs slow.

Partitions, shuffle, and performance levers

Default partition count often follows input file splits (one partition per HDFS block or Parquet file). After a groupBy or wide join, Spark shuffles: it hash-partitions by key, writes sorted buckets to disk, and exchanges data across the network. Shuffle dominates runtime on large joins — a 2 TB skewed join on customer_id can hang for hours if one key owns 40% of rows.

Tuning habits that matter

  • Partition count — target 128–512 MB per partition after shuffle; use repartition(n) before heavy joins, coalesce(n) only to reduce without full shuffle.
  • Broadcast joins — when one side is under ~10–50 MB (configurable via spark.sql.autoBroadcastJoinThreshold), Spark ships a hash table to every executor and avoids shuffle.
  • Salting skewed keys — duplicate hot keys with random suffix, join, then aggregate — classic fix for order-status or celebrity-user skew.
  • File layout — write partitioned Parquet by date or region so filters skip files; avoid millions of tiny files (target 128 MB–1 GB per file).
  • Cache wiselypersist(MEMORY_AND_DISK) only for DataFrames reused across multiple actions; unpersist when done.
  • Avoid collect() — pulls entire datasets to the driver; use take(10) for samples or write to storage for full output.

Spark SQL and Structured Streaming

Spark SQL lets you query DataFrames with SQL or the DataFrame API interchangeably. Register temp views (df.createOrReplaceTempView("orders")), run spark.sql("SELECT ..."), and read/write Hive metastore tables, Delta Lake, Iceberg, or JDBC sources. Catalyst pushes filters and projections into Parquet readers — a WHERE order_date = '2026-06-01' on partitioned data skips irrelevant directories entirely.

Structured Streaming treats a stream as an unbounded table: micro-batches read from Kafka, apply the same DataFrame transforms, and write sinks (Parquet, JDBC, console) with checkpointed exactly-once semantics. Watermarks bound state for windowed aggregations. For sub-second complex event processing with large state, compare Flink; for nightly batch over lake files, vanilla batch Spark is simpler and cheaper.

Worked example: Harbor Supply nightly order ETL

Harbor Supply lands raw orders/ and line_items/ Parquet in S3 hourly from application databases via CDC. A PySpark job (scheduled by Airflow at 02:00 UTC) consolidates yesterday's partition:

spark = SparkSession.builder \
    .appName("harbor-orders-daily") \
    .config("spark.sql.shuffle.partitions", "200") \
    .getOrCreate()

orders = spark.read.parquet("s3://harbor-lake/raw/orders/dt=2026-06-07/")
items  = spark.read.parquet("s3://harbor-lake/raw/line_items/dt=2026-06-07/")

enriched = orders.join(items, "order_id") \
    .filter(col("status").isin("shipped", "delivered")) \
    .groupBy("warehouse_id", "sku") \
    .agg(
        sum("quantity").alias("units"),
        sum("line_total").alias("revenue")
    )

enriched.write.mode("overwrite") \
    .partitionBy("warehouse_id") \
    .parquet("s3://harbor-lake/mart/daily_sku_revenue/dt=2026-06-07/")

The join is broadcast-eligible on line_items after filtering (18 MB). Output feeds ClickHouse via a follow-on clickhouse-client INSERT step. Runtime: 14 minutes on eight executors (4 cores, 16 GB each) vs nine hours on a single r5.2xlarge Python script. Spark UI showed one shuffle stage (the groupBy); no skew after filtering canceled orders early in the plan.

Engine decision table

NeedPreferWhy
Nightly TB-scale ETL over lake filesSpark batchMature connectors, SQL API, cluster elasticity on EMR/K8s
Sub-second streaming joins with large stateFlinkTrue event-time processing, lower latency than micro-batch
Interactive SQL on billions of pre-loaded rowsClickHouse / warehouseColumnar storage and indexes beat recomputing from raw files
Local analytics on laptop ParquetDuckDBZero cluster overhead; explore before promoting to Spark
Orchestrate DAG of shell scripts and sensorsAirflowScheduling and retries — not a compute engine
Python-native ML on single-node datapandas / PolarsSpark overhead unjustified under ~50 GB in memory

Common pitfalls

  • Millions of small files — one partition per tiny file creates task overhead; compact with coalesce on write or a periodic compaction job.
  • Default shuffle partitions (200) — wrong for 10 GB (too many tiny tasks) or 50 TB (too few huge ones); tune per job size.
  • Uneven key distribution — one hot partition finishes last; monitor stage duration variance and salt keys.
  • Chaining collect() or toPandas() — OOM on the driver when result sets are large.
  • Ignoring the Spark UI — guessing at bottlenecks without checking shuffle spill, GC time, and stragglers.
  • Running Spark for tiny data — cluster startup and serialization cost exceeds benefit under ~100 GB for simple transforms.
  • Mutable global state in UDFs — Python UDFs bypass Catalyst and serialize row-by-row; prefer built-in functions or Scala UDFs for hot paths.

Production checklist

  • Pin Spark, Scala, and Python versions; test upgrades on a staging cluster with production-sized sample data.
  • Set spark.sql.shuffle.partitions, broadcast threshold, and executor memory/overhead explicitly — do not rely on defaults at scale.
  • Partition lake output by filter columns (date, region); target 128 MB–1 GB Parquet files.
  • Enable Spark event logs and integrate with history server or Databricks for post-mortems.
  • Schedule jobs via Airflow with retry and SLA sensors; idempotent overwrite or merge writes.
  • Monitor shuffle bytes, spill, and job duration trends — regressions often trace to upstream schema or volume changes.
  • Document skew fixes and broadcast exceptions; review Catalyst physical plans for new joins.
  • Right-size clusters with dynamic allocation; tear down idle EMR/K8s resources to control cost.

Key takeaways

  • Spark is a distributed compute engine for batch and micro-batch ETL — it processes data in place on lakes, not as a transactional database.
  • DataFrames, Catalyst, and lazy DAG execution are the modern API; understand partitions and shuffle before tuning.
  • Broadcast small sides, salt skewed keys, and write well-partitioned Parquet — file layout matters as much as cluster size.
  • Pair Spark with Airflow for orchestration and ClickHouse or a warehouse for interactive serving.
  • Use the Spark UI; most slow jobs are shuffle, skew, or too many small files — not "Spark is slow."

Related reading