2019 Data Engineering Retrospective: The Year Delta Lake Went Open Source

Data engineering in 2019 had one defining event: Delta Lake going open source in October. Everything else was incremental improvement. It's worth taking a few minutes to document what actually happened this year before the context fades.

The Delta Lake Open Source Release

This is the story of 2019. Delta Lake shipped as an Apache project in October, removing the format's dependency on Databricks as a platform. The practical impact hasn't fully materialized yet — tool support outside Databricks is still limited — but the strategic shift is real. Delta is now a community format, not a vendor feature.

For teams that were hesitant to commit to Delta because of lock-in concerns, that argument is weaker now. For teams already using it, nothing changed operationally. For tool vendors, there's now an open spec to implement against. The ripple effects will be visible in 2020 and beyond.

What Changed in My Pipelines This Year

I standardized on the bronze-silver-gold architecture across all client projects. Before 2019, each project had its own data lake layout that made sense in isolation and was confusing to anyone new to the project. The medallion pattern is opinionated enough to be consistent, flexible enough to work across different domain models.

I added MLflow tracking to every production pipeline. The month-over-month row count trending and job duration tracking caught two data quality issues before they became client-facing problems. That alone justified the effort.

I stopped running pipelines on interactive clusters. All production pipelines now run on job clusters via the Databricks Jobs scheduler. The cloud bill went down; the reliability went up. These should have been different all along.

What I'm Watching Going Into 2020

Spark 3.0 is on the horizon. The preview documentation shows Adaptive Query Execution (AQE) — a feature that lets Spark adjust query plans at runtime based on actual data statistics rather than estimates. For complex queries with unpredictable data distribution, this could be a significant performance improvement. I'm tracking the release date.

The ecosystem around Delta is growing faster than expected. There are credible integrations in early development for Presto and Hive to read Delta format directly. If those ship in 2020, the "Delta or Parquet" question becomes much simpler to answer.

2020 is also the year I expect Databricks to get more serious about SQL-first tooling. The gap between "data engineer working in notebooks" and "analyst writing SQL in a query tool" is real, and it's a friction point on every project. Something is going to address that. I just don't know exactly what yet. As always, I'm here to help.

Read more