The single biggest complaint about ADF since 2014 has been the same: you can copy data, but you can't transform it. Copy Activity moves data from A to B. If you need to derive a column, aggregate rows, join two sources, or apply business logic — you had to hand off to SQL stored procedures, Databricks notebooks, or Azure Functions. ADF was an orchestrator that called other services to do the transformation work.
Mapping Data Flows are the answer. They're in preview as of mid-2018, and I've been running them on a real client project. Here's the honest evaluation.
What Data Flows Are
A Mapping Data Flow is a transformation layer inside ADF, built on Apache Spark. You design transformations visually using a drag-and-drop canvas. The execution engine is Spark running on an Azure Integration Runtime with Spark enabled.
If you've used SSIS, the transformation catalog will feel familiar:
- Source — read from a dataset (Azure SQL, Blob, ADLS, Cosmos, etc.)
- Derived Column — compute new columns using expressions (like SSIS's Derived Column transform)
- Aggregate — group by and aggregate (SUM, COUNT, MAX, etc.)
- Join — inner, left outer, right outer, full outer, cross joins
- Lookup — lookup and merge a single matching row (like SSIS's Lookup transform)
- Conditional Split — route rows to different branches based on conditions
- Flatten — expand arrays and complex types from JSON/XML sources
- Union — combine multiple streams row-wise
- Pivot / Unpivot — column rotation
- Surrogate Key — generate sequential integer keys (useful for dimension loading)
- Rank — assign row ranks within partitions
- Window — window functions (LAG, LEAD, running totals)
- Sink — write to a target dataset
That list covers the vast majority of ETL transformation work. The SSIS comparison is intentional — this is Microsoft telling the SSIS migration story in the cloud.
A Simple Data Flow Example
Here's a transformation I built for a client: read sales transactions from Azure SQL, compute a total including tax, aggregate by product category, write summary rows to Azure Blob as Parquet.
The data flow has four steps:
- SalesSource (Source) — reads from Azure SQL table
SalesTransactions - ComputeTotal (Derived Column) — adds column
TotalWithTax:Amount * (1 + TaxRate) - AggregateByCategory (Aggregate) — group by
ProductCategory, sumTotalWithTax, count rows - ParquetSink (Sink) — writes to ADLS Gen2 as Parquet, partitioned by category
Four transformations. No stored procedure. No Databricks notebook. No Azure Function. The transformation logic lives inside ADF.
The Data Flow expression language is similar to the ADF pipeline expression language but richer — it has Spark-like functions for string manipulation, date handling, and data type casting. The expression editor has decent autocomplete for this context.
The Execution Model: Spark Under the Hood
When a pipeline runs an activity that calls a Data Flow, ADF provisions an Azure IR with Spark enabled. The Data Flow compiles to a Spark execution plan and runs on the cluster.
This is the good news: Spark scales. If your source table has 100 million rows, Data Flows will handle it better than a stored procedure that times out or a SSIS package that runs out of memory on the orchestration server.
This is the other news: Spark cold start.
The Cold Start Problem
Every time a Data Flow runs against a cold cluster, you wait 3-5 minutes for the Spark cluster to provision. Your pipeline triggers. The Data Flow activity starts. You wait. Then it runs. Then it finishes.
For a daily batch pipeline that runs at 2 AM, 4 minutes of startup cost is irrelevant — you're waiting for results at 9 AM anyway. For a pipeline that runs every 15 minutes, 4 minutes of startup per run means you're spending more time on cluster spin-up than on actual data processing.
The current mitigation is to configure a Time to Live (TTL) on the Spark IR. With a 10-minute TTL, the cluster stays warm after a Data Flow run, so the next run within the TTL window skips the cold start. This is configurable but requires some thought about the cost tradeoff — a warm cluster is a running cluster billing you for the idle minutes.
The cold start problem is real. It's a fundamental consequence of the Spark execution model. Design your Data Flow usage around it.
Debug Mode
One genuinely excellent feature: debug mode. When you enable debug on a Data Flow, ADF provisions a small Spark cluster for your session. You can run the Data Flow against a sample of your data (configurable — 100 rows, 1000 rows) and see the output at each transformation step in the canvas.
This makes development much faster. You don't need to run the full pipeline to validate your transformation logic. Debug mode runs in seconds on a warm debug cluster.
The debug cluster is shared across your authoring session. You start it, use it, stop it. The cost is a few dollars per hour for the debug IR. Worth it for active development.
The Honest Verdict
Mapping Data Flows are a real transformation capability for ADF. The transformation catalog is comprehensive, the Spark execution scales for large datasets, and debug mode makes the development experience manageable.
The cold start latency makes Data Flows unsuitable for high-frequency or latency-sensitive workloads. For batch ETL that runs once or a few times per day — which is most of what ADF is designed for — the cold start is a manageable cost.
The SSIS comparison: SSIS data flows execute on the SSIS server with no startup cost. ADF Data Flows have 3-5 minute startup. SSIS wins on latency. ADF wins on scale — a Spark cluster handling 500M rows outperforms SSIS on a single server.
If you're running batch workloads and you've been calling stored procedures for transformation logic, Mapping Data Flows deserve a serious look. The preview is rough in spots, but the direction is right. As always, I'm here to help if you want to talk through whether Data Flows fit your workload.