You stand up a 10-node Databricks cluster. You open a notebook. You write some pandas code to process a 50GB dataset. It runs slower than it did on your laptop. You file a ticket asking why Databricks is worse than your laptop. I've been that person. Here's what was actually happening.
Where pandas Runs
When you run pandas code in a Databricks notebook, it runs on the driver node only. All 9 worker nodes are sitting idle. You paid for a 10-node cluster and are using exactly one of them — and it might not even be your beefiest machine, since the driver is sized for coordination work, not data processing.
This is the driver-bound workload problem. Your computation is bound to the driver because that's where pandas executes. Spark's distributed execution model doesn't apply to pandas operations. They're single-threaded Python on one machine.
The specific trap that gets everyone:
df = spark.read.parquet("abfss://container@storage.dfs.core.windows.net/events/year=2020/")
pandas_df = df.toPandas() # <-- here is your problem
# Now you have a pandas DataFrame. On the driver. All 50GB of it.
result = pandas_df.groupby("event_type")["session_duration"].mean()
print(result)
toPandas() is collect() with extra steps. It pulls every row across the network from all executor JVMs to the driver node and materializes them as a pandas DataFrame in driver memory. If the data fits in driver RAM, it works. You've just paid for 9 nodes you didn't use. If it doesn't fit, the driver OOMs and your job fails.
The Fix: Keep the Work in Spark
The same computation, done correctly:
from pyspark.sql import functions as F
df = spark.read.parquet("abfss://container@storage.dfs.core.windows.net/events/year=2020/")
result = df.groupBy("event_type") .agg(F.avg("session_duration").alias("avg_duration")) .orderBy("avg_duration", ascending=False)
result.show(20)
Here, groupBy and agg run distributed across all your executors. Each executor computes partial aggregates on its partition. Spark combines those partial aggregates into the final result. Only the final, small result set moves to the driver for display. You used the whole cluster.
The Five Patterns That Create Driver-Bound Workloads
1. toPandas() on large DataFrames — shown above. Fine for small result sets after aggregation. Dangerous on raw tables.
2. collect() followed by Python iteration
# Don't do this
rows = df.collect()
for row in rows:
process_something(row["value"])
# Do this instead
df.foreach(lambda row: process_something(row["value"]))
# or better: write a proper transformation as a groupBy/agg/withColumn
3. Row-by-row Python UDFs on large datasets
# This UDF runs in the driver's Python process for each row
# (actually it runs on executors, but in a slow Python process with serialization overhead)
@udf(returnType=StringType())
def clean_phone(phone):
return re.sub(r'D', '', str(phone))
# Better: use built-in Spark functions when possible
from pyspark.sql import functions as F
df = df.withColumn("phone_clean", F.regexp_replace(F.col("phone"), r'D', ''))
4. Reading CSV with pandas then converting to Spark
# This reads the file into driver memory first
import pandas as pd
pandas_df = pd.read_csv("/dbfs/mnt/data/big_file.csv")
spark_df = spark.createDataFrame(pandas_df)
# Read directly with Spark instead
spark_df = spark.read.csv("/mnt/data/big_file.csv", header=True, inferSchema=True)
5. Using pandas for data exploration on unsampled production data
# Sampling first, then using pandas, is fine
sample_df = production_df.sample(fraction=0.01).toPandas()
# Now pandas on ~10k rows is totally reasonable
When pandas Is Appropriate
I'm not saying pandas is wrong in Databricks. It's wrong for large-scale distributed processing. It's right for:
- Small reference datasets (country codes, config tables) that you need to manipulate and broadcast
- Final aggregated result sets (10k rows or fewer) that you want to visualize or export
- Exploratory analysis on a representative sample
- Integration with ML libraries that require pandas input (sklearn, statsmodels)
The rule: pandas on the driver is fine when the data is small. PySpark on the executors is what you want when the data is large. The mistake is reaching for pandas out of habit when the data is large, because pandas is what you know.
Pandas API on Spark (koalas)
Databricks ships a library called Koalas (later merged into PySpark as the pandas-on-Spark API in Spark 3.2) that lets you write pandas-style code that actually executes distributed on Spark. In 2020 it's Koalas, installed separately:
import databricks.koalas as ks
kdf = ks.read_parquet("abfss://container@storage.dfs.core.windows.net/events/year=2020/")
result = kdf.groupby("event_type")["session_duration"].mean()
result.head(20)
Koalas translates pandas API calls into Spark operations under the hood. It's not 100% API-compatible and the error messages can be cryptic, but for teams heavily invested in pandas syntax, it's a lower-friction migration path than rewriting everything to native PySpark.
For new code, learn the native PySpark API. The mental model is clearer, the error messages are better, and you'll understand what's actually happening when things go wrong.
Next: partitions — the unit of parallelism that determines whether your cluster is actually working or mostly waiting.