Five Years of Azure Data Factory: What I'd Tell 2014-Me

Five years. I shipped my first ADF v1 pipeline to production in late 2014. At the time it was called "Azure Data Factory" and it was definitely not what I'd call mature. It had copy jobs, a JSON-based pipeline definition, and an on-premises gateway that mostly worked. That was about it.

It's December 2019. ADF v2 has parameterized pipelines, git integration, CI/CD via ARM templates, Mapping Data Flows, 90+ connectors, SAP connectivity, Snowflake support, monitoring via Log Analytics, and the Tumbling Window Trigger. The distance from 2014 to 2019 is substantial.

I've been thinking about what I'd tell 2014-me, if I could. Here it is.

Things 2014-Me Was Wrong About

"ADF Is SSIS in the Cloud"

This framing seemed helpful for explaining ADF to clients who knew SSIS. It was useful for the first six months. After that, it was a mental trap.

SSIS has data flow transforms, ADF didn't — so ADF looked incomplete. SSIS has rich per-package debugging, ADF had primitive logging — so ADF looked inferior. SSIS packages can be run independently, ADF pipelines couldn't parameterize per-source — so ADF looked like a step backward.

All of these comparisons were correct in 2014-2016. But they weren't comparing equivalent tools. SSIS is a local ETL execution engine with compile-time package generation and server-side execution. ADF is a cloud-native orchestration service with a service-hosted runtime, designed for API-composable infrastructure.

The comparison was useful to start with, but I held onto it too long. It made ADF's design choices look like failures — missing features — instead of what they were: deliberate architectural decisions for a different operating model. When I finally stopped asking "why doesn't ADF do what SSIS does?" and started asking "what is ADF designed to do well?" the tool made a lot more sense.

"V1 Patterns Are Worth Investing In"

I spent 2014-2016 building sophisticated ADF v1 configurations. Custom .NET activities for parameterization (because v1 didn't have parameters). Nested pipeline patterns to work around the lack of ForEach. Scheduled triggers via Azure Scheduler before the v1 Scheduler shipped. All of this is gone now. V1 doesn't run v2 patterns; they're separate resource types.

The lesson: when a cloud service is in v1 and the vendor is clearly working on v2, don't over-invest in v1-specific workarounds. Invest in the patterns that will survive the version transition (the copy-from-source, land-in-lake, transform-to-warehouse model) and accept the v1 workarounds as temporary friction.

I could have told clients to do lighter-weight v1 implementations in 2015-2016 and spent that time building v2 skills instead. Hindsight.

"Git Integration Isn't Coming Soon"

I complained about the lack of git integration for four years. I built compensating mechanisms: JSON exports stored in SharePoint, documented portal change procedures, manual change audits. All of it was technical debt that disappeared when git integration shipped in early 2018.

The lesson: if a fundamental operational feature (source control, CI/CD, parameterization) is missing from a cloud service, it's usually on the roadmap. The wait is worth building the discipline for. Don't build permanent workarounds for temporary gaps. Build the discipline and wait.

Things 2014-Me Was Right About

The JSON-Deployable Model Is Genuinely Better for Cloud-Native Teams

I argued this in 2014 against clients who wanted to stick with SSIS because it had a visual designer and DTSX packages they could inspect. The ADF v1 JSON model was clunky — no parameters, no reuse — but the principle was right: defining data pipelines as JSON infrastructure artifacts, deployable via ARM, is the correct model for cloud-native environments.

Five years later, this is table stakes. Infrastructure as code. GitOps. ARM templates. The ADF model in 2019 — pipeline JSON in git, ARM-deployed to multiple environments — is exactly what 2014-me was pointing toward when it was still theoretical.

The Connector Ecosystem Would Grow to Cover Enterprise Sources

In 2014, ADF had about 20 connectors. SQL Server, Azure SQL, Oracle, MySQL, a few others. Clients with SAP, Salesforce, and Dynamics integration requirements couldn't use ADF as the extraction layer. I told them this would change.

It changed. 90+ connectors in 2019, including SAP (finally), Snowflake, Dynamics 365, Salesforce, REST, and the long tail of file formats and cloud storage systems. The connector gap is no longer a reason to not use ADF for the extraction layer.

The Microsoft 80% Pattern Would Play Out

Microsoft cloud services in v1 typically cover 80% of the use cases. The remaining 20% — parameterization, git integration, ForEach, better monitoring — arrives in v2 and subsequent releases. I said this to clients in 2015 when they complained about missing features. Wait 18-24 months; the obvious missing things will be there.

This is exactly what happened. Parameterization arrived in v2. ForEach arrived. Git integration arrived. Better monitoring arrived. The pattern held.

The Biggest Thing I Missed

The biggest misread of 2014-2018: I was evaluating ADF against SSIS. The more important comparison was ADF vs. bespoke Python ETL scripts, or ADF vs. Databricks orchestration, or ADF vs. the thing clients were doing before they had a proper orchestration layer.

Compared to "a pile of SQL Agent jobs and SFTP scripts held together with a prayer," ADF was dramatically better from day one. That was the real comparison for most clients. The SSIS comparison made me hold ADF to a high bar on features SSIS had; the right comparison would have made ADF's value proposition much clearer.

ADF and SSIS coexist comfortably in 2019 in my client environments. ADF handles cloud-native orchestration, new API integrations, and workloads that need to scale on Spark. SSIS handles the legacy packages with fifteen years of invested logic that run on Azure-SSIS IR. Both are the right tool for their respective scenarios.

Five years. The tool got better, my understanding of it got sharper, and the clients who invested early are running reliable production workloads. Not a bad outcome.

Here's to the next five. As always, I'm here to help.

Read more