Don't Fight Your Platform: The Right Way to Combine Databricks and Self-Managed Jobs
The Databricks-skeptic move I see most often is this: run some jobs on Databricks, keep "control" over others with self-managed Spark on EMR or on-prem, and let both read from the same S3 data lake. The reasoning is that you get Databricks' managed runtime for the jobs that benefit from it, and you retain "flexibility" for the rest.
In theory, this is a good idea. Open formats enable it. In practice, the hybrid architecture creates conflict that costs more than either pure approach would.
This post is not an argument to go all-in on Databricks for ideological reasons. It's a systems design argument: if you mix processing engines on shared data without explicit contracts, you get the failure modes of both without the benefits of either.
The Concurrent Writer Problem
Parquet on S3 has no native write coordination mechanism. There is no lock, no transaction log, no protocol for "this partition is being written to right now, stand by." If a Databricks job and an EMR job both decide to write to the same S3 prefix in the same time window, you can end up with a partially overwritten partition — some files from the Databricks job, some from the EMR job, in whatever interleaved order they completed.
The result is not a write error. It's corrupted output that reads successfully as a valid Parquet partition with wrong data. No alarm fires. The downstream job reads it and produces subtly wrong results.
# Databricks job writes to s3://data-lake/sessions/event_date=2017-07-01/
# simultaneously, an EMR job re-processes the same partition for a backfill
# Both execute:
df.write.mode("overwrite").partitionBy("event_date").parquet("s3://data-lake/sessions/")
# S3 "overwrite" is not atomic — it's a series of PUTs and DELETEs
# If both jobs run simultaneously, the partition will contain a mix of
# files from both jobs in whatever order S3 accepted the writesSchema Drift Between Systems
When two processing environments read and write the same data, schema change management becomes a coordination problem across both environments. If the Databricks team adds a column to an output table, the EMR jobs that read from that table need to be updated. If the EMR jobs add a column, the Databricks notebooks need to be updated. Without a single authoritative schema definition, these two systems drift apart in their understanding of the data's shape — and the drift surfaces as silent wrong output or an obscure exception at 3 a.m.
Runtime Version Skew
Databricks Runtime and the EMR Spark AMI are different builds of Spark, potentially at different versions, with different behavior around edge cases in JSON parsing, Parquet schema evolution, and timestamp handling. A Parquet file written by Databricks Runtime 3.0 (Spark 2.2) may be read differently by EMR's Spark 2.1 in subtle ways. These differences are hard to reproduce, hard to diagnose, and tend to surface in production on data that didn't appear in your test cases.
How to Mix If You Must
If you have genuine business reasons to run multiple processing environments — compliance requirements that mandate on-premises processing for certain data, specific hardware needs not available in Databricks clusters — the hybrid architecture is workable, but only with explicit contracts in place:
- Partition ownership. Each processing system owns specific partitions or tables. Databricks writes the sessions table; EMR writes the raw ingestion table. No system writes to a partition owned by another. Define and document this ownership explicitly.
- Write coordination via sentinel files. Before writing to a shared partition, check for an in-progress sentinel file. Write the sentinel at job start, delete it at completion. The other system checks for the sentinel before writing to the same prefix.
- Aligned Spark versions. Pin both environments to compatible Spark versions and test Parquet read/write compatibility explicitly before deploying.
- Shared schema registry. Use the AWS Glue Data Catalog or a shared Hive Metastore as the single authoritative schema definition. Both systems read schemas from the same source.
The Cleaner Answer
The cleaner answer is to pick one primary processing platform for your data lake and be deliberate about why you're using the other. Databricks as the primary, with EMR for specific workloads that have a documented technical reason to run there, is a reasonable architecture. Databricks and EMR both writing to the same tables because "we have flexibility" is not.
If you're running a hybrid architecture and hitting the conflict patterns described here, I'd like to hear what you've found for mitigation. As always, I'm here to help.