Databricks vs. Snowflake: The Lock-In Question Nobody Is Asking

Last month I wrote about Snowflake's architecture — what's genuinely good about it and what warrants scrutiny. I promised a direct comparison with Databricks. Here it is. I'm going to be specific about where the lock-in surfaces are, because this is a decision with a ten-year tail and the sales conversation at both companies will not surface these tradeoffs for you.

The Lock-In Surface Map

Lock-in is not a binary property — it's a spectrum with different surfaces at different layers. Let's be precise about each platform.

Snowflake lock-in surfaces:

  • Data format. Snowflake's micro-partition format is proprietary. Your data is not readable by anything except Snowflake. Leaving requires a full export via COPY INTO to S3 or Azure Blob, then loading into the next system. For a large warehouse, this is a multi-day project.
  • Query engine. Everything runs through Snowflake's execution engine. You cannot bring your own processing — no Spark, no Python UDFs at meaningful scale, no custom compute. If a capability doesn't exist in Snowflake's SQL dialect, your workaround involves extracting data out.
  • Pricing model. Credits are the unit of cost, and the credit-to-dollar ratio is set by Snowflake. It has moved over time. There is no self-hosted option. If the price increases, your alternatives are renegotiate, leave (see: data format lock-in), or pay.

Databricks lock-in surfaces:

  • DBU rate. Databricks charges per Databricks Unit (DBU), a compute unit on top of the underlying cloud instance cost. The DBU rate is set by Databricks. This is a pricing dependency, not a data dependency.
  • Runtime optimizations. The Databricks Runtime includes proprietary optimizations over open-source Spark — better I/O, improved shuffle, query caching. If you leave Databricks, you lose these and revert to stock Spark performance.
  • Notebook environment and job scheduling UI. Databricks Notebooks and the Jobs UI are proprietary. If you leave, you need to migrate your notebook-based workflows to Jupyter or another environment.

What Databricks does NOT lock you into:

  • Your data is in Parquet on S3 (or ADLS, or GCS). Open format, readable by anything.
  • Your processing code is PySpark, Scala Spark, or SparkSQL — open-source Spark APIs. It runs on stock EMR or any other Spark environment with no modification (minus the proprietary runtime optimizations).
  • Your data is not inside Databricks. It lives in your cloud storage account. Databricks is a compute layer on top of your storage, not a storage system.

The Escape Hatch Comparison

Here's the concrete test: if you need to leave each platform in six months, what does that look like?

Leaving Snowflake: export all tables to S3 via COPY INTO (costs credits, takes time proportional to data volume), load into the new system (Redshift, BigQuery, Databricks), rewrite or migrate all SQL queries to the target dialect, retrain analysts on the new tool. The data itself moves; the format changes; the queries may need adjustment.

Leaving Databricks: point the new Spark environment (stock EMR, another cloud's managed Spark) at the same S3 bucket. Your Parquet files are already there. Your PySpark code runs without modification on stock Spark. You lose the Databricks Runtime performance advantages and the notebook/job UI. Your data and your code are intact and portable.

The Python Problem Snowflake Can't Solve

Snowflake is SQL. If your organization's data work is purely SQL analytics, that's fine — Snowflake is excellent at SQL analytics. But data science teams run Python. ML feature engineering runs Spark. Real-time feature pipelines run Kafka and Flink. All of that lives outside Snowflake, which means your architecture has a perpetual extract-transform-load cycle between Snowflake and the Python world.

Databricks runs SQL, Python, Scala, and R on the same cluster against the same data. Your analytics engineer and your ML engineer work in the same environment, against the same Parquet files, with the same cluster. No extract step. That's a compounding architectural advantage as both teams grow.

The Honest Caveat

I prefer Databricks. I'll say that plainly. The open-format foundation and the unified language model align better with how I build data systems. But Snowflake is the right call for teams that are purely SQL-analytical, have a large volume of concurrent BI queries from separate departments, and are not planning to run Python workloads on the same data. Know your workload before you pick the tool.

If you're making this decision now, I'm happy to work through your specific requirements. As always, I'm here to help.

Read more