I started seriously learning Databricks in 2019 when Microsoft deprioritized Azure Data Lake Analytics. It's now mid-2022. Three years of production use, multiple clients, real production incidents, real performance tuning wins, and a few decisions I'd make differently if I were starting over. Here's the honest accounting.
What Carried Over Directly
The relational thinking. Every data problem I frame in terms of sets, joins, and aggregations. PySpark DataFrames are set-based. SQL on Delta is set-based. Nothing about the mental model changed — only the syntax and the execution layer changed. The ability to decompose a complex data transformation into a series of relational operations is the most transferable skill from SQL Server to Spark.
Window functions. ROW_NUMBER, RANK, LEAD, LAG — they're all there, with the same semantics. The syntax is slightly different (F.row_number().over(Window.partitionBy("x").orderBy("y")) vs. ROW_NUMBER() OVER (PARTITION BY x ORDER BY y)), but the logic is identical. Anyone who knows T-SQL window functions is 80% of the way to PySpark window functions on day one.
Incremental load patterns. The MERGE pattern (upsert new rows, update changed rows, optionally delete removed rows) is the same in Delta Lake as it is in SQL Server. The FULL pattern (truncate-and-reload), the APPEND pattern (add new rows), the SCD Type 2 pattern — all of these translate directly. I reuse the same mental framework I built for SQL Server ETL.
Index and query optimization intuition. The instinct to ask "is this predicate being pushed down to the scan?" directly translates. In SQL Server, you asked whether the WHERE clause was doing an index seek or a scan. In Spark, you ask whether the filter appears in PushedFilters in the physical plan and whether Delta's file-level statistics are skipping irrelevant row groups. Same concern, different mechanism.
What Required a Different Mental Model
Driver vs. executor awareness. This was the hardest shift. In SQL Server, there's one machine doing all the work. You never had to think "is this running on one machine or many?" In Spark, everything you write runs either on the driver (one machine, your code context) or on executors (many machines, distributed). pandas is driver-only. Built-in Spark functions are executor-distributed. Every pandas call in a Databricks notebook is a potential bottleneck — something that never existed as a concept in SQL Server development.
Lazy evaluation debugging.** SQL Server executes each statement immediately. You can step through a stored procedure and inspect results after each operation. In Spark, transformations are lazy — you build up a DAG and then fire an action. When something goes wrong, the stack trace points to the action, not the transformation that produced bad data. Learning to add strategic intermediate .show() calls (and removing them in production) was a workflow change.
Cluster cost as a first-order concern. In SQL Server, the server runs whether you're using it or not. You think about query performance, but you don't think about whether running a particular query is going to add $50 to the cloud bill. In Databricks, every cluster decision — size, type, termination policy, spot vs. on-demand — has a direct cost implication. This became a first-order design consideration that didn't exist in my SQL Server work.
No stored procedures, no IDENTITY columns, no SSMS. These sound small but aren't. Stored procedures in SQL Server are the fundamental unit of reusable logic and security boundary. In PySpark, Python functions replace stored procedures — more powerful, but less governed by default. IDENTITY columns require either monotonically_increasing_id() (which produces non-sequential IDs) or UUID generation. And SSMS — I still miss a good visual query planner. The Spark UI's physical plan view is fine; it's not SSMS.
The Decisions I'd Make Differently
Use Delta Lake from day one, no exceptions. Early on I kept some staging tables as plain Parquet for "simplicity." Every one of those tables eventually caused a production issue — partial writes after job failures, schema drift, repair table headaches. Delta should have been the default from the start.
Learn partition tuning earlier. I spent months accepting slow jobs that I later discovered were slow because of 200-partition shuffle overhead or a broadcast join threshold that was too low. Understanding the partition model deeply in the first month would have saved many frustrating debugging sessions.
Set up monitoring before you need it. On SQL Server, I had Database Mail alerts and SQL Server Agent failure notifications configured by default on every instance. I didn't set up Databricks job failure notifications with the same rigor initially. An overnight job failure that I didn't discover until morning was the corrective event. Configure alerts before you go live, not after the first incident.
The Bottom Line
The scale-out analytics vision that SQL Server PDW promised in 2010 — parallel query execution across many nodes, elastic capacity, distributed storage — is real in Databricks in 2022 and accessible without a $400,000 hardware purchase. The learning curve from T-SQL to PySpark/Spark SQL is real but not steep for someone with a strong relational foundation. The things you know carry over. The things you need to learn are learnable.
Three years in, I write more Python than SQL. I still think in SQL. Both are true simultaneously and aren't in conflict.