Mapping Data Flows went generally available in late 2018. I've been running them in production for real client workloads since the preview. Six months of real data, real failures, real optimizations. Here's what I've learned.
What Works Well
The Transformation Set Is Comprehensive
After six months, I have not hit a transformation scenario that the Data Flow catalog couldn't handle. Derived columns, aggregations, joins on multiple keys, slowly changing dimension logic via Surrogate Key and conditional splits, complex JSON flattening — all covered. The SSIS data flow transform comparison I made in 2018 holds: Data Flows have caught up on transformation capability.
Spark Scales for Large Datasets
One client's pipeline processes 800 million rows of transaction history as part of a monthly data refresh. Running this in a stored procedure on Azure SQL Managed Instance was timing out at 6 hours and consuming resources that affected production query performance. Moved to a Data Flow: 47 minutes, isolated compute, no impact on the SQL instance.
This is the Data Flow use case. Large-scale batch transformation where SQL can't scale and you don't want to write PySpark from scratch. Data Flows give you the Spark execution model with a visual authoring experience.
Debug Mode Changed How I Develop
I touched on debug mode in 2018. After six months of daily use: it's the most important Data Flow feature for developer productivity. Start a debug session in the morning, develop against a 1000-row sample throughout the day, stop the cluster at the end of the day. The per-hour cost of the debug IR is trivial compared to the time saved by not running full pipeline executions to validate logic.
The data preview at each transformation step — seeing the output schema and sample rows as you build the flow — catches errors immediately. I find bugs in my join keys or derived column expressions before I ever run a full pipeline.
Cluster TTL Solved the Cold Start Problem (Partially)
Configuring a 15-minute TTL on the Spark IR means pipelines that run frequently reuse a warm cluster. For workloads that run every 30 minutes or more frequently, the first run pays the startup cost; subsequent runs within the TTL window skip it entirely.
The cost tradeoff: a warm cluster is a running cluster. At Azure IR Spark pricing, a 15-minute idle TTL after each run costs money. Do the math for your run frequency vs. the time-value of the startup cost. For pipelines running every 30 minutes, the TTL-based approach is cheaper overall than paying the startup cost on every run.
What Still Challenges
Row-Level Debugging Is Limited
Data preview in debug mode shows you schema and sample rows. It does not show you why a specific row failed a transformation, or where in the data flow a specific row was routed. When a production Data Flow produces incorrect output, diagnosing which transformation step introduced the error requires adding intermediate Sink transformations that write to debug storage, then inspecting those files.
SSIS's data viewers — which let you see every row passing through a specific connection in real time — are genuinely better for row-level debugging. This is the one area where SSIS's execution model is more developer-friendly than ADF Data Flows.
Join Performance Tuning Is Partially Opaque
ADF generates the Spark execution plan from your Data Flow definition. You have limited control over how the plan is generated. For most workloads, the default behavior is fine. For workloads where performance matters — large joins, complex aggregations — you're working within constraints.
Available tuning knobs:
- Partitioning strategy — round robin, hash, key-based. For large joins, hash partitioning on the join key improves performance by co-locating matching rows on the same partition.
- Broadcast hints — for small lookup tables (under a few GB), broadcast hints tell Spark to replicate the small table to all nodes, avoiding a shuffle join.
- Cluster size — more cores handles larger workloads, but doesn't fix logic-level inefficiencies.
Beyond these, you're trusting ADF's Spark plan generation. For most batch ETL workloads, that's fine. For high-performance scenarios, you may find Databricks' explicit PySpark gives you more control.
Complex Multi-Flow Logic Is Harder to Debug
When your transformation spans multiple Data Flows (because ADF has per-flow limitations on complexity, or because you've modularized by concern), debugging a logical error that spans flows requires running each flow individually against test data. The flows don't share execution context across pipeline activities, so there's no unified trace of a row through the entire transformation chain.
The SSIS Comparison in 2019
A year ago I said Data Flows had closed the transformation capability gap with SSIS. I still believe that. The 2019 update:
- SSIS still wins on row-level debugging — data viewers are superior to Data Flow preview
- Data Flows win on scale — Spark execution for large datasets outperforms SSIS on a single server
- Data Flows win on cloud native — no SSIS server to manage, auto-scaling, pay-per-run
- SSIS still wins on cold start latency — no Spark spin-up for high-frequency workloads
- Parity on transformation coverage — the SSIS catalog has no meaningful functionality gaps relative to Data Flows anymore
For new greenfield ETL projects in Azure with batch workloads over large datasets: Data Flows. For legacy SSIS estates with complex packages and SSIS-expert teams: keep SSIS running on Azure-SSIS IR until you have a good reason to migrate.
Data Flows are production-grade for batch ETL. The preview roughness is gone. If you've been waiting for GA stability before committing, the wait is over. As always, I'm here to help if you want to talk through whether Data Flows are the right fit for a specific workload.