You've been writing GROUP BY aggregations in T-SQL for years. PySpark has all the same aggregate functions — and a few you probably wished SQL Server had. Here's the translation, plus a few patterns that are unique to the distributed world.
The Basic Pattern
from pyspark.sql import functions as F
# T-SQL: SELECT region, COUNT(*), SUM(revenue), AVG(revenue), MIN(revenue), MAX(revenue) FROM orders GROUP BY region
df.groupBy("region").agg(
F.count("*").alias("order_count"),
F.sum("revenue").alias("total_revenue"),
F.avg("revenue").alias("avg_revenue"),
F.min("revenue").alias("min_revenue"),
F.max("revenue").alias("max_revenue")
)
COUNT DISTINCT
-- T-SQL
SELECT region, COUNT(DISTINCT customer_id) AS unique_customers FROM orders GROUP BY region;
# PySpark
df.groupBy("region").agg(
F.countDistinct("customer_id").alias("unique_customers")
)
# approx_count_distinct() is much faster for large datasets
# Uses HyperLogLog -- accurate to within ~5%
df.groupBy("region").agg(
F.approx_count_distinct("customer_id", rsd=0.05).alias("approx_unique_customers")
)
For analytics use cases where an exact count isn't required, approx_count_distinct avoids an expensive deduplication shuffle. For a 1B-row table, the performance difference can be an order of magnitude.
Standard Deviation and Variance
-- T-SQL
SELECT product_id, STDEV(price) AS price_stdev, VAR(price) AS price_var FROM sales GROUP BY product_id;
# PySpark (sample stddev/variance, equivalent to STDEV/VAR in SQL Server)
df.groupBy("product_id").agg(
F.stddev("price").alias("price_stdev"),
F.variance("price").alias("price_var")
)
# Population versions (equivalent to STDEVP/VARP)
df.groupBy("product_id").agg(
F.stddev_pop("price"),
F.var_pop("price")
)
First and Last
-- T-SQL (SQL Server requires window functions for this -- no FIRST() aggregate)
# PySpark has first() and last() as aggregate functions
df.groupBy("customer_id").agg(
F.first("order_date").alias("first_order"), # first seen in partition order
F.last("order_date").alias("last_order")
)
# Be careful: without orderBy before groupBy, first/last are non-deterministic
# For deterministic first/last, use window functions instead
Collecting Values Into a List
This one T-SQL doesn't have natively (requires STRING_AGG or XML tricks):
# Collect all values for a key into an array
df.groupBy("customer_id").agg(
F.collect_list("product_name").alias("all_products"), # preserves duplicates
F.collect_set("product_name").alias("unique_products") # deduplicated
)
The result column is an ArrayType column. You can then explode it back out, slice it, or use array functions on it. Useful for building recommendation feature vectors, for pivoting product lists, or for audit-trail aggregation.
Pivot (like SQL Server PIVOT)
-- T-SQL PIVOT
SELECT customer_id, [Electronics], [Apparel], [Grocery]
FROM orders
PIVOT (SUM(revenue) FOR category IN ([Electronics], [Apparel], [Grocery])) AS pvt;
# PySpark
df.groupBy("customer_id") .pivot("category", ["Electronics", "Apparel", "Grocery"]) .agg(F.sum("revenue"))
Omit the values list and PySpark scans the distinct values automatically — but this requires a full pass over the data to discover the pivot values before pivoting, which triggers an extra job. For known categories, always specify them explicitly.
HAVING Clause
-- T-SQL
SELECT customer_id, COUNT(*) AS order_count FROM orders GROUP BY customer_id HAVING COUNT(*) > 10;
# PySpark: groupBy/agg first, then filter on the result
df.groupBy("customer_id") .agg(F.count("*").alias("order_count")) .filter(F.col("order_count") > 10)
There's no having() method in PySpark — you filter the aggregated result. Catalyst recognizes this pattern and optimizes it the same way SQL Server would optimize a HAVING clause.
ROLLUP and CUBE
-- T-SQL
SELECT region, category, SUM(revenue) FROM sales GROUP BY ROLLUP(region, category);
# PySpark (Spark SQL only -- no DataFrame API equivalent in 2020)
spark.sql("""
SELECT region, category, SUM(revenue) AS revenue
FROM sales
GROUP BY region, category WITH ROLLUP
""")
# CUBE
spark.sql("""
SELECT region, category, SUM(revenue) AS revenue
FROM sales
GROUP BY region, category WITH CUBE
""")
ROLLUP and CUBE produce null values for the subtotal rows — same as SQL Server. The GROUPING() function works in Spark SQL to distinguish null-as-subtotal from null-as-data.