PySpark Window Functions: Translating ROW_NUMBER, RANK, and LAG to Python

If you know SQL Server window functions — ROW_NUMBER, RANK, LEAD, LAG, running totals — you already know the hardest conceptual part of PySpark window functions. The syntax is different but the semantics map directly. Here's the translation.

The Import You Need

from pyspark.sql import functions as F
from pyspark.sql.window import Window

The Window class is how you define the window specification — the equivalent of the OVER (PARTITION BY ... ORDER BY ...) clause in T-SQL.

ROW_NUMBER

-- T-SQL
SELECT
    order_id,
    customer_id,
    order_date,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
FROM orders;

# PySpark
w = Window.partitionBy("customer_id").orderBy(F.col("order_date").desc())

df.withColumn("rn", F.row_number().over(w))

Classic use case — keep only each customer's most recent order:

latest_orders = (
    df.withColumn("rn", F.row_number().over(w))
      .filter(F.col("rn") == 1)
      .drop("rn")
)

RANK and DENSE_RANK

-- T-SQL
SELECT
    product_id,
    category,
    revenue,
    RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS rnk,
    DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS dense_rnk
FROM products;

# PySpark
w = Window.partitionBy("category").orderBy(F.col("revenue").desc())

df.withColumn("rnk", F.rank().over(w))   .withColumn("dense_rnk", F.dense_rank().over(w))

LAG and LEAD

-- T-SQL
SELECT
    order_id,
    customer_id,
    order_date,
    total_amount,
    LAG(total_amount, 1, 0) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order_amount,
    LEAD(order_date, 1) OVER (PARTITION BY customer_id ORDER BY order_date) AS next_order_date
FROM orders;

# PySpark
w = Window.partitionBy("customer_id").orderBy("order_date")

df.withColumn("prev_order_amount", F.lag("total_amount", 1, 0).over(w))   .withColumn("next_order_date", F.lead("order_date", 1).over(w))

The argument order: F.lag(column, offset, default). Offset defaults to 1. Default value (when there's no previous row) defaults to None/null.

Running Totals and Cumulative Aggregations

-- T-SQL
SELECT
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_revenue
FROM daily_sales;

# PySpark
w = Window.orderBy("order_date")           .rowsBetween(Window.unboundedPreceding, Window.currentRow)

df.withColumn("cumulative_revenue", F.sum("daily_revenue").over(w))

For a 7-day rolling average:

w_7day = Window.orderBy("order_date")                .rowsBetween(-6, Window.currentRow)  # current row + 6 preceding

df.withColumn("rolling_7day_avg", F.avg("daily_revenue").over(w_7day))

NTILE

-- T-SQL
SELECT customer_id, total_revenue,
    NTILE(4) OVER (ORDER BY total_revenue DESC) AS revenue_quartile
FROM customers;

# PySpark
w = Window.orderBy(F.col("total_revenue").desc())
df.withColumn("revenue_quartile", F.ntile(4).over(w))

First Value / Last Value

-- T-SQL
SELECT
    customer_id,
    order_date,
    FIRST_VALUE(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS first_order_date
FROM orders;

# PySpark
w = Window.partitionBy("customer_id").orderBy("order_date")
df.withColumn("first_order_date", F.first("order_date").over(w))

# LAST_VALUE with ROWS BETWEEN (important for correct results)
w_last = Window.partitionBy("customer_id")                .orderBy("order_date")                .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn("last_order_date", F.last("order_date").over(w_last))

The ROWS BETWEEN clause on LAST_VALUE is the same gotcha as in SQL Server — without it, the default window frame for ordered windows is rows preceding the current row, which gives you the current row's value, not the partition's last value.

Performance Note

Window functions in Spark trigger a shuffle — all rows in the same partition key need to land on the same executor for the ordering and frame calculation to work. Large partition keys (low cardinality, like a 2-value "status" column) can create shuffle skew. High cardinality partition keys (like customer_id on 10M customers) are usually fine — the data distributes naturally across many partition groups.

If you're joining the result of a window function back to the original table, consider whether you can chain the window operation instead of creating an intermediate DataFrame and re-joining. Every unnecessary shuffle costs you.

PySpark Window Functions: Translating ROW_NUMBER, RANK, and LAG to Python

Shannon Lowder

The Import You Need

ROW_NUMBER

RANK and DENSE_RANK

LAG and LEAD

Running Totals and Cumulative Aggregations

NTILE

First Value / Last Value

Performance Note

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving