If you know SQL Server window functions — ROW_NUMBER, RANK, LEAD, LAG, running totals — you already know the hardest conceptual part of PySpark window functions. The syntax is different but the semantics map directly. Here's the translation.
The Import You Need
from pyspark.sql import functions as F
from pyspark.sql.window import Window
The Window class is how you define the window specification — the equivalent of the OVER (PARTITION BY ... ORDER BY ...) clause in T-SQL.
ROW_NUMBER
-- T-SQL
SELECT
order_id,
customer_id,
order_date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
FROM orders;
# PySpark
w = Window.partitionBy("customer_id").orderBy(F.col("order_date").desc())
df.withColumn("rn", F.row_number().over(w))
Classic use case — keep only each customer's most recent order:
latest_orders = (
df.withColumn("rn", F.row_number().over(w))
.filter(F.col("rn") == 1)
.drop("rn")
)
RANK and DENSE_RANK
-- T-SQL
SELECT
product_id,
category,
revenue,
RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS rnk,
DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS dense_rnk
FROM products;
# PySpark
w = Window.partitionBy("category").orderBy(F.col("revenue").desc())
df.withColumn("rnk", F.rank().over(w)) .withColumn("dense_rnk", F.dense_rank().over(w))
LAG and LEAD
-- T-SQL
SELECT
order_id,
customer_id,
order_date,
total_amount,
LAG(total_amount, 1, 0) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order_amount,
LEAD(order_date, 1) OVER (PARTITION BY customer_id ORDER BY order_date) AS next_order_date
FROM orders;
# PySpark
w = Window.partitionBy("customer_id").orderBy("order_date")
df.withColumn("prev_order_amount", F.lag("total_amount", 1, 0).over(w)) .withColumn("next_order_date", F.lead("order_date", 1).over(w))
The argument order: F.lag(column, offset, default). Offset defaults to 1. Default value (when there's no previous row) defaults to None/null.
Running Totals and Cumulative Aggregations
-- T-SQL
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_revenue
FROM daily_sales;
# PySpark
w = Window.orderBy("order_date") .rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn("cumulative_revenue", F.sum("daily_revenue").over(w))
For a 7-day rolling average:
w_7day = Window.orderBy("order_date") .rowsBetween(-6, Window.currentRow) # current row + 6 preceding
df.withColumn("rolling_7day_avg", F.avg("daily_revenue").over(w_7day))
NTILE
-- T-SQL
SELECT customer_id, total_revenue,
NTILE(4) OVER (ORDER BY total_revenue DESC) AS revenue_quartile
FROM customers;
# PySpark
w = Window.orderBy(F.col("total_revenue").desc())
df.withColumn("revenue_quartile", F.ntile(4).over(w))
First Value / Last Value
-- T-SQL
SELECT
customer_id,
order_date,
FIRST_VALUE(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS first_order_date
FROM orders;
# PySpark
w = Window.partitionBy("customer_id").orderBy("order_date")
df.withColumn("first_order_date", F.first("order_date").over(w))
# LAST_VALUE with ROWS BETWEEN (important for correct results)
w_last = Window.partitionBy("customer_id") .orderBy("order_date") .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn("last_order_date", F.last("order_date").over(w_last))
The ROWS BETWEEN clause on LAST_VALUE is the same gotcha as in SQL Server — without it, the default window frame for ordered windows is rows preceding the current row, which gives you the current row's value, not the partition's last value.
Performance Note
Window functions in Spark trigger a shuffle — all rows in the same partition key need to land on the same executor for the ordering and frame calculation to work. Large partition keys (low cardinality, like a 2-value "status" column) can create shuffle skew. High cardinality partition keys (like customer_id on 10M customers) are usually fine — the data distributes naturally across many partition groups.
If you're joining the result of a window function back to the original table, consider whether you can chain the window operation instead of creating an intermediate DataFrame and re-joining. Every unnecessary shuffle costs you.