Building a Metadata-Driven ADF Pipeline: What Works and What Doesn't

If you've done SSIS development at scale, you know BIML. You define a config table — source tables, target tables, column mappings, load patterns — and BIML generates N perfectly isolated SSIS packages at compile time. Each package is its own executable, its own set of connection managers, its own control flow. You can run them in parallel, debug them independently, fail one without affecting the others. The metadata-driven approach is the right pattern for enterprise ETL at scale.

ADF has no equivalent. I spent two months trying to build one. Here's what I found.

The Pattern I Attempted

Config table in Azure SQL Database:

CREATE TABLE dbo.ETL_Config (
  ConfigID INT IDENTITY(1,1) PRIMARY KEY,
  SourceLinkedService NVARCHAR(200),
  SourceTable NVARCHAR(200),
  SinkLinkedService NVARCHAR(200),
  SinkTable NVARCHAR(200),
  WatermarkColumn NVARCHAR(100) NULL,
  LastLoadDate DATETIME NULL,
  IsActive BIT DEFAULT 1,
  LoadOrder INT DEFAULT 0
);

The intended flow: a "controller" pipeline reads config rows, and for each active row, triggers a "worker" pipeline with the config values. The worker pipeline executes a parameterized copy for that specific source/sink combination, updates the watermark, signals success.

The problem: ADF has no native parameterization. There is no way to pass parameters to a pipeline at trigger time in V1. The "parameters" the worker pipeline needs — source table name, sink table name, watermark date — cannot be passed in at runtime. They must be baked into the pipeline JSON at deploy time.

The Workaround That Partially Works

State lives in a shared SQL table. The controller pipeline calls a stored procedure that reads the next pending config row and writes it to a "current execution" table. The worker pipeline reads its configuration from that table. Post-execution, the worker updates the execution table to mark the config row as complete and advances the watermark.

-- Current execution state table
CREATE TABLE dbo.ETL_CurrentExecution (
  ExecutionID INT IDENTITY(1,1) PRIMARY KEY,
  ConfigID INT NOT NULL,
  SliceStart DATETIME NOT NULL,
  AssignedAt DATETIME DEFAULT GETUTCDATE(),
  Status NVARCHAR(20) DEFAULT 'Pending'
);

-- Controller stored procedure
CREATE PROCEDURE dbo.usp_AssignNextConfig
  @SliceStart DATETIME
AS
BEGIN
  DECLARE @ConfigID INT;

  SELECT TOP 1 @ConfigID = ConfigID
  FROM dbo.ETL_Config
  WHERE IsActive = 1
    AND NOT EXISTS (
      SELECT 1 FROM dbo.ETL_CurrentExecution e
      WHERE e.ConfigID = dbo.ETL_Config.ConfigID
        AND e.SliceStart = @SliceStart
        AND e.Status IN ('Pending', 'Running')
    )
  ORDER BY LoadOrder;

  IF @ConfigID IS NOT NULL
  BEGIN
    INSERT INTO dbo.ETL_CurrentExecution (ConfigID, SliceStart, Status)
    VALUES (@ConfigID, @SliceStart, 'Running');
    SELECT @ConfigID AS ConfigID;
  END
END

The worker pipeline reads the current execution row, executes Copy Activity against the source/sink defined there, and marks the execution as complete. This works — in isolated conditions.

The Failure Modes

Race conditions with concurrent slices. ADF can run multiple slices in parallel if concurrency is greater than 1 on the pipeline. Two concurrent slices both calling usp_AssignNextConfig simultaneously can both pick up the same config row if the first hasn't committed its INSERT yet. You end up with the same source table being loaded twice into the same sink table for the same slice window. Data corruption, not a failing pipeline — the worst kind of bug.

Fix: set concurrency: 1 on the pipeline. Now you've eliminated the parallelism benefit that made the metadata-driven pattern worthwhile in the first place. This is the fundamental tension.

No real looping. ADF doesn't have a loop construct. To process N config rows, you chain N pipeline activations or use a trigger that re-fires the pipeline N times. Both approaches have limits: chaining creates deeply nested dependency graphs that are hard to manage, and trigger-chaining has a limit on recursion depth. For a config table with 50 source tables, this approach breaks down.

Coarse error isolation. In BIML-generated SSIS packages, each package is isolated. Source table X failing has zero effect on source table Y — they run as separate packages. In ADF, if the shared state table gets corrupted by a mid-execution failure, all subsequent config rows in the same pipeline run are affected. One bad config row can kill the entire pipeline run.

The Uncomfortable SSIS Comparison

BIML solves the metadata-driven problem at compile time. The config table drives code generation, not runtime behavior. You generate N isolated packages from the config, each with its own dependencies, its own error handling, its own connection managers. The metadata drives the build, not the execution. Runtime state is isolated per package.

ADF cannot do this. There is no compile-time code generation from metadata. The JSON for each pipeline is static — it defines exactly one workflow. To get N workflows for N config rows, you need N pipelines. That means either generating N pipeline JSON files programmatically (possible, but now you have a code generation layer outside ADF to maintain), or forcing runtime parameterization through shared state (which has the race condition and isolation problems described above).

I've heard "parameterization is coming in V2." I'm told V2 will support runtime parameters at the pipeline level — you will be able to pass values into a pipeline when you trigger it. If that ships, the metadata-driven pattern becomes genuinely viable. Until it does, ADF is repeating the SSIS pattern: Microsoft gave SSIS a deployment model without a built-in metadata framework, and third parties (BIML) filled the gap. ADF shipped without parameterization, and customers are building workarounds.

Practical Guidance

Build the metadata-driven pattern if you need it. Accept the limitations. Document them explicitly for your client or team: concurrency must be 1, error isolation is pipeline-level not config-row-level, the looping pattern has a ceiling. These are known constraints, not surprises.

For config tables with fewer than 20 rows, the workaround is manageable. For config tables with 50+ rows, generate N pipeline JSON files programmatically and deploy them — isolation is more important than elegance at that scale. For any scenario where true runtime parameterization is required, wait for V2 or build a custom activity that handles the parameterization internally.

V2 parameterization will change this calculus significantly. When it ships, I'll revisit this post and update the guidance. If you're building a metadata-driven ADF framework right now and hitting the walls I described, I'm here to help.

Read more