Spark SQL UDFs vs Built-in Functions: When to Write Your Own and When to Stop
There's a performance trap in Databricks that SQL Server DBAs fall into consistently. You know T-SQL functions — DATEPART, ISNULL, CHARINDEX, SUBSTRING. You know they're fast. So when you move to Spark SQL, you write user-defined functions for the cases where you need something the built-ins don't cover. And you discover, usually after seeing a query take 40 minutes instead of 4, that Python UDFs in Spark are a very different beast.
The Problem With Python UDFs
When you define a Python UDF in PySpark, Spark serializes each row, sends it from the JVM (where Spark runs) to a Python process (where your function runs), executes the Python, serializes the result, and sends it back. This row-by-row serialization is slow — often 100x slower than the equivalent operation using Spark's built-in functions.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
# This looks fine but has serious performance implications on large tables
@udf(returnType=StringType())
def categorize_order(amount):
if amount < 100:
return "small"
elif amount < 1000:
return "medium"
else:
return "large"
# Slow: row-by-row serialization
df.withColumn("order_category", categorize_order("total_amount"))The Right Approach: Built-in Functions First
Almost every UDF you'd write can be replaced with Spark's built-in functions, which run in the JVM with no serialization overhead:
from pyspark.sql.functions import when, col
# Fast: runs natively in the JVM
df.withColumn("order_category",
when(col("total_amount") < 100, "small")
.when(col("total_amount") < 1000, "medium")
.otherwise("large")
)Before writing a UDF, check the Spark SQL functions documentation. String manipulation (regexp_replace, substring, split), date handling (date_trunc, date_add, datediff), math, JSON parsing (get_json_object, from_json) — the built-in library is extensive. Most "I need a custom function for this" cases turn out to have a built-in answer.
When You Genuinely Need a UDF: Pandas UDFs
For cases where you truly need custom Python logic, Pandas UDFs (also called vectorized UDFs) are dramatically faster than row-level Python UDFs. Instead of serializing row by row, they pass entire Arrow batches between JVM and Python:
import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType
@pandas_udf(StringType())
def categorize_order_vectorized(amounts: pd.Series) -> pd.Series:
return amounts.apply(lambda x: "small" if x < 100 else ("medium" if x < 1000 else "large"))
df.withColumn("order_category", categorize_order_vectorized("total_amount"))Pandas UDFs are still slower than built-in Spark functions, but they're orders of magnitude faster than standard Python UDFs. If you're doing complex string parsing, calling a Python library (like a geocoder), or applying logic that genuinely can't be expressed with built-in functions, Pandas UDFs are the right tool.
The Decision Tree
- Built-in Spark SQL function? Use it — fastest option.
- Can be expressed as a combination of built-ins? Do that — still JVM-native.
- Genuinely needs custom Python logic? Use a Pandas UDF — vectorized, much faster than row-level.
- Truly row-level with no vectorizable alternative? Regular Python UDF — but profile it first on a sample.
The habit to build: when you're writing a UDF, ask whether you're writing it because you need it or because it's the first approach that came to mind. Most of the time, the built-in functions are already there. As always, I'm here to help.