If you've spent your career in SQL Server, the first time someone says "Spark" you probably picture something on fire. The second time, you hear "it's like Hadoop but faster." Neither of these is useful. Let me give you the version I wish someone had given me.
The Problem Spark Solves
SQL Server is excellent at what it does. But it runs on one machine. Even when you throw hardware at it — more RAM, faster NVMe, higher core count — you're scaling up, not out. There's a ceiling, and if your data is growing faster than hardware can scale, you hit it.
The alternative is to distribute the computation across many machines. Instead of one server doing all the work, you have ten (or a hundred) servers each handling a slice of the data simultaneously. The results are aggregated at the end. This is what Google's original MapReduce paper described in 2004, what Apache Hadoop implemented, and what Apache Spark dramatically improved on starting around 2009.
Spark is a distributed computing engine. It takes a dataset, splits it into pieces called partitions, sends those partitions to worker machines called executors, runs your transformations on each partition in parallel, and collects the results. The core innovation over Hadoop is that Spark keeps intermediate data in memory instead of writing it to disk between steps — which is why it's typically 10–100x faster for iterative workloads.
The Spark Execution Model in Plain Terms
Every Spark application has two logical components:
- Driver: The process that runs your code — your notebook, your script, your application. It's responsible for parsing your transformations, building an execution plan, and coordinating the work across worker machines. One per application.
- Executors: JVM processes running on worker machines that actually execute the computation. Each executor handles tasks assigned by the driver and caches data in memory. Many per application, depending on cluster size.
The driver is where you are, in terms of code execution. The executors are where the actual heavy lifting happens. Understanding this split is the single most important concept in Spark programming — I'll spend an entire post on it, but keep it in mind from the start.
How This Compares to SQL Server Execution
In SQL Server, when you run a query, the query optimizer produces an execution plan and the storage engine executes it. With parallelism turned on (MAXDOP > 1), SQL Server can use multiple threads, but they all run on the same machine and share the same memory. It's intra-node parallelism.
Spark is inter-node parallelism. Different machines. Different memory spaces. Data moves across the network when operations require it (this is a "shuffle" — more on that later). The upside is near-linear horizontal scale: add machines, get more throughput. The downside is that network I/O is slow, and certain operations that are trivial in SQL Server (like sorting the entire result set) become expensive in distributed systems.
What Spark Doesn't Replace
Spark is not a database. It doesn't manage storage — it reads from and writes to external storage systems (Azure Data Lake Storage, Amazon S3, HDFS, Delta Lake). It's not optimized for transactional workloads (row-by-row inserts, small frequent queries). It doesn't have indexes in the traditional sense.
What it does well: batch processing of large datasets, complex multi-step transformations, ML training pipelines, streaming ingestion at scale, and analytics on data volumes that won't fit comfortably on a single machine's storage or memory.
Databricks Is Spark, Packaged
Apache Spark is open source. You can run it yourself on a cluster, but provisioning and tuning Spark clusters is its own specialty. Databricks is a managed platform built by Spark's original creators that handles cluster management, autoscaling, notebook environments, and a number of proprietary optimizations on top of the open source engine.
For this series, I'm working in Databricks. The Spark API is identical — code you write here will run on any Spark cluster with minimal changes. The Databricks-specific parts (Delta Lake, MLflow, cluster config) will be called out explicitly.
Next up: getting a cluster running in Databricks Community Edition so you can follow along.