Azure Data Factory: Six Months Into Production

A client engagement in Q3 2014 pushed me to put ADF into actual production — not a proof of concept, not a sandbox, but pipelines moving real data that real people depend on for real decisions. Six months later, I have an honest report on what held up and what didn't.

What Held Up

The Data Management Gateway. I expected this to be the fragile part. A lightweight agent running on a Windows server in the client's datacenter, maintaining an outbound tunnel to Azure, mediating all on-premises SQL Server access — there are plenty of ways this could break. In six months, the gateway went down twice: once because a Windows Update rebooted the host server without anyone checking whether ADF was mid-run, and once because the registration token expired in a way I didn't know was possible. Both were resolved in under an hour. For production on-premises connectivity, the gateway is solid.

Retry policies. I set retry to 3 with a 30-second interval on every Copy Activity. This fired four times over six months — twice on transient Azure SQL throttling, once on a network hiccup during a large blob copy, once on a gateway connectivity blip. All four resolved on retry without human intervention. The slice tracking meant no data was lost or duplicated — ADF knew exactly where to pick up.

JSON deployment discipline. The workflow I described in the setup post — JSON in git, PowerShell deployment, portal for monitoring only — held up well. Every change to the factory went through the deployment script. When I needed to roll back a bad dataset definition, I had the previous version in git. This discipline requires intentional effort, but it pays off immediately the first time you need to revert something.

Slice backfill. We had a two-day outage when the client's source SQL Server went offline for emergency maintenance. ADF accumulated 48 failed slices. When the server came back, ADF processed the backlog automatically — oldest first, in order. The downstream reports were stale for two days, then caught up within six hours. With SQL Agent scheduling, those 48 runs would have been gone. The slice model's backfill behavior is a genuine operational advantage.

What Didn't Hold Up

Browser authoring at scale. By month four we had 23 pipelines. The Author and Deploy section of the portal had become a tree of JSON nodes that was difficult to navigate. There is no search, no filter, no grouping beyond the factory-level object type categories. Finding a specific dataset among 60 requires scrolling. Renaming an object means editing every reference to it manually. The tooling does not scale to real factory complexity.

Monitoring depth. The monitoring view shows slice status. That's useful — I know which slices failed. What it doesn't show: how long did each activity in the pipeline take, how many rows did the Copy Activity move before it failed, what was the SQL error message that caused the Stored Procedure Activity to fail (the truncated version in the portal is often not enough to diagnose without going to the gateway machine's Event Log).

I built a supplemental monitoring layer: a SQL table that each pipeline's final stored procedure writes execution metadata into (pipeline name, slice start, slice end, rows processed, execution time, status). A Power BI report sits on top of it. This took two days to build and should have been part of the product.

The git gap is getting more expensive. Two developers on the same factory, no git integration, deploying JSON files from separate machines. We established a convention: always pull the repo before making changes, always commit before deploying. It works until it doesn't — someone edits a pipeline directly in the portal to "just fix this one thing quickly" and doesn't commit. That edit is now invisible to the next deploy. We found it three weeks later when a deploy overwrote it. This is not a workflow failure; it's a tool failure.

New Connectors Appearing

FTP linked service appeared in late 2014/early 2015. SFTP is in preview. Oracle via DMG has been production-stable for a month. The connector list is growing, and the new additions follow the same JSON pattern — a linked service definition, dataset, Copy Activity support. Onboarding a new connector type now takes an hour, not a day.

The Revised Verdict

ADF is production-worthy for cloud-native workloads with the right operational discipline. The data movement is reliable. The slice model's resilience is a real advantage. The gaps — monitoring depth, browser authoring at scale, no git integration — are real costs that your team will pay in engineering time. Budget for them explicitly. Don't assume the tool will handle it.

The clients I'd recommend ADF to right now: greenfield Azure-first teams, small dev team (1-2 people), workloads where cloud-native integration is the primary value. The clients I'd steer toward supplementing ADF with stronger operational tooling: teams of 3+ engineers on the same factory, complex pipelines where monitoring depth matters, environments where an audit trail of changes is a compliance requirement.

2015 is the year ADF needs to close the git and monitoring gaps. I'll be watching closely. If you're in production with ADF and have questions, I'm here to help.

Azure Data Factory: Six Months Into Production

Shannon Lowder

What Held Up

What Didn't Hold Up

New Connectors Appearing

The Revised Verdict

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving