PySpark DataFrames for the T-SQL Developer: Your First Week's Vocabulary

The operations you've been running in T-SQL for years exist in PySpark. They just look different. This post is the translation table I wish I'd had during my first week with Databricks — every fundamental T-SQL operation mapped to its PySpark equivalent, with actual examples you can run.

Setup: Imports You Need

from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import *

# In Databricks, spark (SparkSession) is already available
# df is your DataFrame -- created by reading data as shown in previous posts

SELECT

-- T-SQL
SELECT customer_id, name, region FROM customers;

# PySpark
df.select("customer_id", "name", "region")

# With column expressions
df.select(
    F.col("customer_id"),
    F.col("name"),
    F.upper(F.col("region")).alias("region_upper")
)

WHERE / Filter

-- T-SQL
SELECT * FROM customers WHERE region = 'West' AND created_date >= '2020-01-01';

# PySpark
df.filter(
    (F.col("region") == "West") &
    (F.col("created_date") >= "2020-01-01")
)

# .where() is an alias for .filter() -- same thing
df.where(F.col("region") == "West")

GROUP BY / Aggregation

-- T-SQL
SELECT region, COUNT(*) AS customer_count, SUM(total_revenue) AS revenue
FROM customers
GROUP BY region;

# PySpark
df.groupBy("region").agg(
    F.count("*").alias("customer_count"),
    F.sum("total_revenue").alias("revenue")
)

ORDER BY

-- T-SQL
SELECT * FROM customers ORDER BY created_date DESC;

# PySpark
df.orderBy(F.col("created_date").desc())

# Multiple columns
df.orderBy(
    F.col("region").asc(),
    F.col("created_date").desc()
)

JOIN

-- T-SQL
SELECT c.name, o.order_date, o.total_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;

# PySpark
customers.join(orders, on="customer_id", how="inner")          .select("name", "order_date", "total_amount")

# Join types: inner, left, right, outer (full), left_semi, left_anti
# left_semi = SQL EXISTS
# left_anti = SQL NOT EXISTS / NOT IN

# When column names differ
customers.join(orders,
    customers["customer_id"] == orders["cust_id"],
    how="left"
)

Adding / Deriving Columns

-- T-SQL
SELECT *, UPPER(name) AS name_upper, YEAR(created_date) AS created_year FROM customers;

# PySpark
df.withColumn("name_upper", F.upper(F.col("name")))   .withColumn("created_year", F.year(F.col("created_date")))

DISTINCT

-- T-SQL
SELECT DISTINCT region FROM customers;

# PySpark
df.select("region").distinct()

UNION ALL / UNION

-- T-SQL
SELECT customer_id FROM customers_2019
UNION ALL
SELECT customer_id FROM customers_2020;

# PySpark (always UNION ALL — no automatic deduplication)
customers_2019.union(customers_2020)

# For UNION (deduplicated):
customers_2019.union(customers_2020).distinct()

TOP / LIMIT

-- T-SQL
SELECT TOP 10 * FROM customers ORDER BY created_date DESC;

# PySpark
df.orderBy(F.col("created_date").desc()).limit(10)

CASE WHEN

-- T-SQL
SELECT name,
    CASE WHEN total_revenue > 100000 THEN 'Enterprise'
         WHEN total_revenue > 10000  THEN 'Mid-Market'
         ELSE 'SMB'
    END AS segment
FROM customers;

# PySpark
df.withColumn("segment",
    F.when(F.col("total_revenue") > 100000, "Enterprise")
     .when(F.col("total_revenue") > 10000, "Mid-Market")
     .otherwise("SMB")
)

NULL Handling

-- T-SQL
SELECT ISNULL(phone, 'unknown') AS phone FROM customers;
SELECT * FROM customers WHERE email IS NOT NULL;

# PySpark
df.withColumn("phone", F.coalesce(F.col("phone"), F.lit("unknown")))
df.filter(F.col("email").isNotNull())
df.filter(F.col("email").isNull())

The Key Mental Shift

T-SQL is declarative — you describe what you want and the engine plans the execution. PySpark DataFrames are also declarative, and Spark's Catalyst optimizer plays the same role as SQL Server's query optimizer. The difference is in how you chain operations: PySpark returns a new DataFrame from each transformation, so you build up a pipeline of operations before any execution happens.

Everything above is a transformation. Nothing runs until you call an action: .show(), .count(), .write(), or .collect(). That's the lazy evaluation model from the previous post, applied to everything you just learned.

For anything that maps to T-SQL and isn't in this list, check pyspark.sql.functions — it has over 300 built-in functions covering date math, string manipulation, array operations, JSON parsing, and more. If it's in T-SQL, there's almost certainly a Spark equivalent.