Python Won the Data Engineering Language War
The Scala vs. Python debate for Spark has been running since Spark launched. Two years ago, the Scala camp had a real argument: Python's PySpark API lagged behind, serialization overhead for Python UDFs was measurable, and if you were doing anything performance-sensitive, Scala was the right call.
That argument is weaker now, and it's getting weaker every quarter. Python won. Here's why that happened and what it means for how data engineering teams should be staffing and coding.
What Changed on the PySpark Side
PySpark's DataFrame API, introduced in Spark 1.3 and stabilized through 1.5 and 1.6, operates at the JVM layer for most operations. When you call df.groupBy("user_id").agg(F.count("event_id")) from Python, the actual execution happens in the JVM — the Python process is just issuing instructions. The serialization overhead that made early PySpark RDD jobs slower than their Scala equivalents largely disappears when you're working through the DataFrame API.
Python UDFs are still slow. If you write a udf() in Python, data crosses the Python-JVM boundary for every row and every invocation. The answer is to not write Python UDFs for anything performance-sensitive — use the built-in Spark functions (pyspark.sql.functions) wherever possible. When you absolutely need custom logic, Scala UDFs registered and called from Python get you JVM-speed execution with a Python calling interface.
The Ecosystem Argument
The bigger reason Python won is not the performance gap narrowing. It's the ecosystem.
Data scientists run Python. The entire machine learning ecosystem — scikit-learn, NumPy, pandas, matplotlib — is Python. When your data engineering team speaks Scala and your data science team speaks Python, the handoff between them involves a format translation, an environment translation, and a language translation. Things get lost. Prototypes that work in a Jupyter notebook don't make it cleanly into production because the production system speaks a different language.
When both teams run Python, the prototype in the notebook and the production Spark job share libraries, share patterns, and can share code. The distance between "data scientist proves the logic works" and "data engineer puts it in production" shrinks dramatically.
A Real PySpark Workflow
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
spark = SparkSession.builder.appName("user-engagement").getOrCreate()
events = spark.read.parquet("s3://data-lake/processed/events/2016/04/")
# Session reconstruction using window functions — no Python UDF needed
session_window = Window.partitionBy("user_id").orderBy("event_ts")
with_sessions = events.withColumn(
"time_since_last",
F.col("event_ts") - F.lag("event_ts").over(session_window)
).withColumn(
"new_session",
(F.col("time_since_last") > 1800).cast("int") # 30-minute session gap
).withColumn(
"session_id",
F.concat(F.col("user_id"), F.lit("_"),
F.sum("new_session").over(session_window).cast("string"))
)
session_summary = with_sessions.groupBy("user_id", "session_id").agg(
F.count("event_id").alias("event_count"),
F.min("event_ts").alias("session_start"),
F.max("event_ts").alias("session_end"),
(F.max("event_ts") - F.min("event_ts")).alias("duration_seconds")
)
session_summary.write.mode("overwrite").partitionBy("event_date") \
.parquet("s3://data-lake/processed/sessions/2016/04/")This is idiomatic PySpark using window functions — every operation runs in the JVM. A senior data scientist who has never seen Spark can read this code and understand what it does. A data engineer who has never run an ML pipeline can adapt it to process ML features instead of session events. That shared readability has real value.
Where Scala Still Wins
Structured Streaming (in preview as of Spark 2.0, coming later this year) is currently richer in Scala. Custom partitioners and serializers still need to be written in the JVM. Libraries that wrap complex native code and haven't published Python bindings. If you're building low-latency streaming jobs or writing framework-level Spark infrastructure, Scala is still the right tool. For the large majority of data pipeline work — batch transformations, aggregations, feature engineering — Python is the right default in 2016.
If you're in a shop still defaulting to Scala for batch PySpark work, I'd like to hear the reasoning. As always, I'm here to help.