Pandas UDFs in Spark: Vectorized Operations for Python-Native Engineers

PySpark DataFrames are the native Spark API. They're powerful, they're explicit about what's happening at the execution level, and they're the thing data engineers are usually writing. But there's a class of operations where Python data scientists are already experts — statistical computations, complex feature engineering, arbitrary Python library calls — and forcing those into PySpark idioms is awkward.

Pandas UDFs bridge that gap. They let you write Python functions that operate on Pandas Series or DataFrames, executed in parallel across Spark partitions, with Arrow-based serialization that's dramatically faster than row-by-row Python UDFs.

The Two Types That Matter

There are multiple Pandas UDF types, but two cover most use cases:

Series to Series (scalar UDF) — takes one column, returns one column, applied row-by-row in vectorized batches:

import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType

@pandas_udf(StringType())
def extract_domain(email_series: pd.Series) -> pd.Series:
    return email_series.str.extract(r'@(.+)$')[0].fillna("unknown")

df.withColumn("email_domain", extract_domain("email")).show(5)

Iterator of Series (grouped UDF) — takes an iterator of batches, useful when you need to apply a stateful or expensive-to-initialize operation once per batch rather than per row:

from typing import Iterator
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import FloatType

@pandas_udf(FloatType())
def score_orders(amount_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
    # Expensive initialization happens once per executor, not once per row
    model = load_scoring_model()  # hypothetical — could be a pickle, an API, etc.

    for amounts in amount_iter:
        scores = model.predict(amounts.values)
        yield pd.Series(scores)

df.withColumn("risk_score", score_orders("total_amount"))

Performance vs Regular Python UDFs

Regular Python UDFs serialize each row individually between JVM and Python — typically 10–100x slower than built-in Spark functions. Pandas UDFs use Apache Arrow for serialization, which batches rows and compresses data. The performance difference is real:

  • For simple string operations: built-in Spark functions are still faster than Pandas UDFs, which are faster than row-level Python UDFs
  • For complex operations involving Python libraries (regex, NLP, scoring): Pandas UDFs are the practical choice
  • For operations requiring initialization overhead (loading a model, connecting to an external service): the iterator form amortizes that cost across batches

When to Use Them

Use a Pandas UDF when you need Python logic that:

  • Can't be expressed with Spark's built-in functions
  • Needs access to a Python library (scipy, scikit-learn, a custom geocoder)
  • Benefits from vectorized operations (Pandas string methods, numpy math)

Don't use them as a default for everything — check the built-in function library first. But when you genuinely need Python, Pandas UDFs are the right tool. As always, I'm here to help.

Read more