Spark SQL UDFs vs Built-in Functions: When to Write Your Own and When to Stop

There's a performance trap in Databricks that SQL Server DBAs fall into consistently. You know T-SQL functions — DATEPART, ISNULL, CHARINDEX, SUBSTRING. You know they're fast. So when you move to Spark SQL, you write user-defined functions for the cases where you need something the built-ins don't cover. And you discover, usually after seeing a query take 40 minutes instead of 4, that Python UDFs in Spark are a very different beast.

The Problem With Python UDFs

When you define a Python UDF in PySpark, Spark serializes each row, sends it from the JVM (where Spark runs) to a Python process (where your function runs), executes the Python, serializes the result, and sends it back. This row-by-row serialization is slow — often 100x slower than the equivalent operation using Spark's built-in functions.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# This looks fine but has serious performance implications on large tables
@udf(returnType=StringType())
def categorize_order(amount):
    if amount < 100:
        return "small"
    elif amount < 1000:
        return "medium"
    else:
        return "large"

# Slow: row-by-row serialization
df.withColumn("order_category", categorize_order("total_amount"))

The Right Approach: Built-in Functions First

Almost every UDF you'd write can be replaced with Spark's built-in functions, which run in the JVM with no serialization overhead:

from pyspark.sql.functions import when, col

# Fast: runs natively in the JVM
df.withColumn("order_category",
  when(col("total_amount") < 100, "small")
  .when(col("total_amount") < 1000, "medium")
  .otherwise("large")
)

Before writing a UDF, check the Spark SQL functions documentation. String manipulation (regexp_replace, substring, split), date handling (date_trunc, date_add, datediff), math, JSON parsing (get_json_object, from_json) — the built-in library is extensive. Most "I need a custom function for this" cases turn out to have a built-in answer.

When You Genuinely Need a UDF: Pandas UDFs

For cases where you truly need custom Python logic, Pandas UDFs (also called vectorized UDFs) are dramatically faster than row-level Python UDFs. Instead of serializing row by row, they pass entire Arrow batches between JVM and Python:

import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType

@pandas_udf(StringType())
def categorize_order_vectorized(amounts: pd.Series) -> pd.Series:
    return amounts.apply(lambda x: "small" if x < 100 else ("medium" if x < 1000 else "large"))

df.withColumn("order_category", categorize_order_vectorized("total_amount"))

Pandas UDFs are still slower than built-in Spark functions, but they're orders of magnitude faster than standard Python UDFs. If you're doing complex string parsing, calling a Python library (like a geocoder), or applying logic that genuinely can't be expressed with built-in functions, Pandas UDFs are the right tool.

The Decision Tree

  1. Built-in Spark SQL function? Use it — fastest option.
  2. Can be expressed as a combination of built-ins? Do that — still JVM-native.
  3. Genuinely needs custom Python logic? Use a Pandas UDF — vectorized, much faster than row-level.
  4. Truly row-level with no vectorizable alternative? Regular Python UDF — but profile it first on a sample.

The habit to build: when you're writing a UDF, ask whether you're writing it because you need it or because it's the first approach that came to mind. Most of the time, the built-in functions are already there. As always, I'm here to help.

Read more