2019 Data Engineering Retrospective: The Year Delta Lake Went Open Source

Shannon Lowder

27 Dec 2019 — 2 min read

Data engineering in 2019 had one defining event: Delta Lake going open source in October. Everything else was incremental improvement. It's worth taking a few minutes to document what actually happened this year before the context fades.

The Delta Lake Open Source Release

This is the story of 2019. Delta Lake shipped as an Apache project in October, removing the format's dependency on Databricks as a platform. The practical impact hasn't fully materialized yet — tool support outside Databricks is still limited — but the strategic shift is real. Delta is now a community format, not a vendor feature.

For teams that were hesitant to commit to Delta because of lock-in concerns, that argument is weaker now. For teams already using it, nothing changed operationally. For tool vendors, there's now an open spec to implement against. The ripple effects will be visible in 2020 and beyond.

What Changed in My Pipelines This Year

I standardized on the bronze-silver-gold architecture across all client projects. Before 2019, each project had its own data lake layout that made sense in isolation and was confusing to anyone new to the project. The medallion pattern is opinionated enough to be consistent, flexible enough to work across different domain models.

I added MLflow tracking to every production pipeline. The month-over-month row count trending and job duration tracking caught two data quality issues before they became client-facing problems. That alone justified the effort.

I stopped running pipelines on interactive clusters. All production pipelines now run on job clusters via the Databricks Jobs scheduler. The cloud bill went down; the reliability went up. These should have been different all along.

What I'm Watching Going Into 2020

Spark 3.0 is on the horizon. The preview documentation shows Adaptive Query Execution (AQE) — a feature that lets Spark adjust query plans at runtime based on actual data statistics rather than estimates. For complex queries with unpredictable data distribution, this could be a significant performance improvement. I'm tracking the release date.

The ecosystem around Delta is growing faster than expected. There are credible integrations in early development for Presto and Hive to read Delta format directly. If those ship in 2020, the "Delta or Parquet" question becomes much simpler to answer.

2020 is also the year I expect Databricks to get more serious about SQL-first tooling. The gap between "data engineer working in notebooks" and "analyst writing SQL in a query tool" is real, and it's a friction point on every project. Something is going to address that. I just don't know exactly what yet. As always, I'm here to help.

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

I wrote recently about Azure Agent Mesh and OpenSharing — two infrastructure layers that between them cover how enterprises register, discover, share, and execute agents. Between them, they address a lot of the plumbing that has been missing from the enterprise agent stack. But there's a gap neither of

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

Unity AI Gateway, announced at DAIS this week, is the feature I've been waiting for since Agent Bricks shipped last year. It's a centralized governance layer for model access in Databricks — you configure which models are approved for use in your environment, who can call them,

You Don't Need Fable. You Need a Router.

The performance gap between open-weight models and closed frontier models has spent the last year collapsing faster than anyone predicted. Epoch AI's tracking puts open weights at roughly a three-to-four-month lag behind state-of-the-art closed models on average. For coding tasks, the gap has effectively closed — DeepSeek V3.2

DAIS 2026: Genie One and the Context Problem Databricks Is Solving

The central message from DAIS this week, delivered by Ali Ghodsi in the opening keynote, was direct: AI doesn't have an intelligence problem, it has a context problem. If your CFO can't get an AI system to explain why margins changed, that's not a