Guide
Apache Spark explained
Your data team lands 400 GB of order events in S3 every night. A single-machine Python script takes nine hours and dies on out-of-memory errors halfway through a join. Apache Spark is the open-source distributed compute engine behind most large-scale ETL, feature engineering, and batch analytics jobs: it splits datasets into partitions, runs transformations in parallel across executors, and optimizes SQL plans with the Catalyst optimizer. Originally built at UC Berkeley's AMPLab, Spark powers lakehouse pipelines at Databricks, Netflix, Apple, and thousands of mid-size teams running on Kubernetes or EMR. This guide covers Spark's driver-executor architecture, RDDs vs DataFrames, lazy evaluation and stages, partitioning and shuffle — the dominant cost center — Spark SQL and Structured Streaming, a Harbor Supply nightly ETL worked example, an engine decision table, common pitfalls, and a production checklist alongside our ETL/ELT pipelines guide, Kafka guide, and lakehouse overview.
What Spark is — and what it is not
Spark is a general-purpose distributed data processing engine. You write transformations (map, filter, join, groupBy) in Python (PySpark), Scala, SQL, or R; Spark schedules them across a cluster and handles fault tolerance by recomputing lost partitions from lineage. It is optimized for batch and micro-batch workloads over data already stored in files (Parquet, JSON, ORC), databases, or streams — not for millisecond OLTP transactions.
Spark is not a database like ClickHouse or PostgreSQL. It does not own storage; it reads from object stores, HDFS, or JDBC sources, processes in memory and on disk spill, and writes results back out. Nor is it a workflow scheduler — pair Spark jobs with Airflow or Dagster for dependency graphs, retries, and SLAs. Spark executes the heavy step; orchestrators coordinate when it runs.
Core concepts
- SparkSession — entry point in Spark 2.x+; holds the SparkContext, SQL catalog, and configuration.
- DataFrame / Dataset — typed tabular abstraction with Catalyst optimization; preferred over raw RDDs for most work.
- RDD — resilient distributed dataset; low-level partition collection with lineage graph (legacy but still under the hood).
- Transformation — lazy operation (filter, select, join) that builds a DAG without executing yet.
- Action — triggers execution (count, collect, write) and materializes stages.
- Partition — logical slice of a DataFrame; parallelism unit mapped to tasks on executors.
- Shuffle — cross-partition data exchange (groupBy, join, repartition); expensive network and disk I/O.
- Stage — set of tasks separated by shuffle boundaries; Spark schedules stages sequentially or in parallel where possible.
Architecture: driver, executors, and cluster managers
A Spark application has one driver process (your main()
or notebook kernel) and one or more executors on worker nodes.
The driver parses your code, builds the logical and physical plans, breaks work
into tasks, and ships them to executors. Executors run tasks, cache partitions in
memory or spill to disk, and return results or write output.
Cluster managers — standalone, YARN, Mesos, or
Kubernetes
— allocate CPU and memory containers. On K8s, the Spark Operator or
spark-submit --master k8s://... spins up driver and executor pods;
dynamic allocation scales executor count with queue depth. Right-size executor
memory (heap + overhead) and cores: too few executors underutilize the cluster;
too many tiny executors add scheduling overhead.
Lazy evaluation and the DAG
Transformations are lazy: chaining df.filter(...).select(...).groupBy(...)
only records a lineage graph. An action like df.write.parquet(...) or
df.count() triggers Catalyst to optimize the logical plan (predicate
pushdown, column pruning, constant folding) and the physical plan (broadcast hash
join vs sort-merge join). The UI at port 4040 shows stages, shuffle read/write
bytes, and straggler tasks — your first stop when a job runs slow.
Partitions, shuffle, and performance levers
Default partition count often follows input file splits (one partition per HDFS block
or Parquet file). After a groupBy or wide join, Spark
shuffles: it hash-partitions by key, writes sorted buckets to disk, and exchanges
data across the network. Shuffle dominates runtime on large joins — a 2 TB skewed
join on customer_id can hang for hours if one key owns 40% of rows.
Tuning habits that matter
- Partition count — target 128–512 MB per partition after shuffle; use
repartition(n)before heavy joins,coalesce(n)only to reduce without full shuffle. - Broadcast joins — when one side is under ~10–50 MB (configurable via
spark.sql.autoBroadcastJoinThreshold), Spark ships a hash table to every executor and avoids shuffle. - Salting skewed keys — duplicate hot keys with random suffix, join, then aggregate — classic fix for order-status or celebrity-user skew.
- File layout — write partitioned Parquet by
dateorregionso filters skip files; avoid millions of tiny files (target 128 MB–1 GB per file). - Cache wisely —
persist(MEMORY_AND_DISK)only for DataFrames reused across multiple actions; unpersist when done. - Avoid
collect()— pulls entire datasets to the driver; usetake(10)for samples or write to storage for full output.
Spark SQL and Structured Streaming
Spark SQL lets you query DataFrames with SQL or the DataFrame API
interchangeably. Register temp views (df.createOrReplaceTempView("orders")),
run spark.sql("SELECT ..."), and read/write Hive metastore tables,
Delta Lake, Iceberg, or JDBC sources. Catalyst pushes filters and projections into
Parquet readers — a WHERE order_date = '2026-06-01' on partitioned
data skips irrelevant directories entirely.
Structured Streaming treats a stream as an unbounded table: micro-batches read from Kafka, apply the same DataFrame transforms, and write sinks (Parquet, JDBC, console) with checkpointed exactly-once semantics. Watermarks bound state for windowed aggregations. For sub-second complex event processing with large state, compare Flink; for nightly batch over lake files, vanilla batch Spark is simpler and cheaper.
Worked example: Harbor Supply nightly order ETL
Harbor Supply lands raw orders/ and line_items/ Parquet in
S3 hourly from application databases via
CDC.
A PySpark job (scheduled by Airflow at 02:00 UTC) consolidates yesterday's partition:
spark = SparkSession.builder \
.appName("harbor-orders-daily") \
.config("spark.sql.shuffle.partitions", "200") \
.getOrCreate()
orders = spark.read.parquet("s3://harbor-lake/raw/orders/dt=2026-06-07/")
items = spark.read.parquet("s3://harbor-lake/raw/line_items/dt=2026-06-07/")
enriched = orders.join(items, "order_id") \
.filter(col("status").isin("shipped", "delivered")) \
.groupBy("warehouse_id", "sku") \
.agg(
sum("quantity").alias("units"),
sum("line_total").alias("revenue")
)
enriched.write.mode("overwrite") \
.partitionBy("warehouse_id") \
.parquet("s3://harbor-lake/mart/daily_sku_revenue/dt=2026-06-07/")
The join is broadcast-eligible on line_items after filtering (18 MB).
Output feeds
ClickHouse via a
follow-on clickhouse-client INSERT step. Runtime: 14 minutes on eight
executors (4 cores, 16 GB each) vs nine hours on a single r5.2xlarge Python script.
Spark UI showed one shuffle stage (the groupBy); no skew after filtering canceled
orders early in the plan.
Engine decision table
| Need | Prefer | Why |
|---|---|---|
| Nightly TB-scale ETL over lake files | Spark batch | Mature connectors, SQL API, cluster elasticity on EMR/K8s |
| Sub-second streaming joins with large state | Flink | True event-time processing, lower latency than micro-batch |
| Interactive SQL on billions of pre-loaded rows | ClickHouse / warehouse | Columnar storage and indexes beat recomputing from raw files |
| Local analytics on laptop Parquet | DuckDB | Zero cluster overhead; explore before promoting to Spark |
| Orchestrate DAG of shell scripts and sensors | Airflow | Scheduling and retries — not a compute engine |
| Python-native ML on single-node data | pandas / Polars | Spark overhead unjustified under ~50 GB in memory |
Common pitfalls
- Millions of small files — one partition per tiny file creates task overhead; compact with
coalesceon write or a periodic compaction job. - Default shuffle partitions (200) — wrong for 10 GB (too many tiny tasks) or 50 TB (too few huge ones); tune per job size.
- Uneven key distribution — one hot partition finishes last; monitor stage duration variance and salt keys.
- Chaining
collect()ortoPandas()— OOM on the driver when result sets are large. - Ignoring the Spark UI — guessing at bottlenecks without checking shuffle spill, GC time, and stragglers.
- Running Spark for tiny data — cluster startup and serialization cost exceeds benefit under ~100 GB for simple transforms.
- Mutable global state in UDFs — Python UDFs bypass Catalyst and serialize row-by-row; prefer built-in functions or Scala UDFs for hot paths.
Production checklist
- Pin Spark, Scala, and Python versions; test upgrades on a staging cluster with production-sized sample data.
- Set
spark.sql.shuffle.partitions, broadcast threshold, and executor memory/overhead explicitly — do not rely on defaults at scale. - Partition lake output by filter columns (date, region); target 128 MB–1 GB Parquet files.
- Enable Spark event logs and integrate with history server or Databricks for post-mortems.
- Schedule jobs via Airflow with retry and SLA sensors; idempotent overwrite or merge writes.
- Monitor shuffle bytes, spill, and job duration trends — regressions often trace to upstream schema or volume changes.
- Document skew fixes and broadcast exceptions; review Catalyst physical plans for new joins.
- Right-size clusters with dynamic allocation; tear down idle EMR/K8s resources to control cost.
Key takeaways
- Spark is a distributed compute engine for batch and micro-batch ETL — it processes data in place on lakes, not as a transactional database.
- DataFrames, Catalyst, and lazy DAG execution are the modern API; understand partitions and shuffle before tuning.
- Broadcast small sides, salt skewed keys, and write well-partitioned Parquet — file layout matters as much as cluster size.
- Pair Spark with Airflow for orchestration and ClickHouse or a warehouse for interactive serving.
- Use the Spark UI; most slow jobs are shuffle, skew, or too many small files — not "Spark is slow."
Related reading
- ETL and ELT data pipelines explained — where Spark fits in ingest, transform, and load stages
- Apache Kafka explained — streaming source for Structured Streaming jobs
- Data warehouses and lakehouses explained — lake storage patterns Spark reads and writes
- Apache Airflow explained — scheduling and operating Spark jobs in production DAGs