ADF Pipeline Concepts: Understanding Activities, Datasets, and Linked Services

When you come to ADF from SSIS, you carry a mental model of packages, connection managers, data flows, and control flows. Some of that maps cleanly to ADF. Some of it maps badly enough to cause real confusion. Let's build the right mental model from the start.

Linked Services: The Connection Layer

In SSIS, a Connection Manager holds the connection string and authentication details for a data source. It lives inside the package and travels with it. In ADF, a Linked Service is the same concept but elevated to a first-class factory-level object — shared across all datasets and activities in the factory.

Linked services come in two flavors: cloud and on-premises. Cloud linked services authenticate directly. On-premises linked services require the Data Management Gateway (a lightweight agent you install on a machine in your network). More on the gateway in the setup post — for now, just know that every connection has a linked service behind it.

{
  "name": "OnPremSQLLinkedService",
  "properties": {
    "type": "OnPremisesSqlServer",
    "typeProperties": {
      "connectionString": "Data Source=MYSERVER;Initial Catalog=MyDB;Integrated Security=False;User ID=adfuser;Password=...",
      "gatewayName": "MyDataGateway"
    }
  }
}

Datasets: Metadata, Not Data

This trips people up: a dataset in ADF is not a result set. It is metadata that describes where data lives and how it is partitioned in time. The dataset points at a linked service and defines the structure, location, and availability of data. Think of it as the schema plus the address plus the scheduling hint.

The availability block is where ADF's scheduling model lives. Every dataset has a frequency and an interval. These drive the slice model — ADF's core scheduling mechanism.

{
  "name": "DailySalesBlob",
  "properties": {
    "type": "AzureBlob",
    "linkedServiceName": "AzureStorageLinkedService",
    "typeProperties": {
      "folderPath": "sales/input/{Year}/{Month}/{Day}",
      "partitionedBy": [
        { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" }},
        { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" }},
        { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" }}
      ],
      "format": { "type": "TextFormat", "columnDelimiter": ",", "firstRowAsHeader": true }
    },
    "availability": { "frequency": "Day", "interval": 1 },
    "external": true
  }
}

The external: true flag is critical for source datasets. It tells ADF that this data is produced outside the factory — ADF is not responsible for generating it. Without this flag on your source, ADF will wait forever for a slice it expects to create itself and never will.

The Slice Model: ADF's Scheduling Mechanism

Here is where ADF diverges most sharply from SQL Agent and SSIS scheduling. ADF does not simply run a pipeline at a time. It divides the pipeline's active period into slices based on the output dataset's availability frequency. Each slice is an independent unit of work with its own state: Waiting, Ready, In Progress, Succeeded, Failed, Skipped.

If you define a pipeline active from March 1 to March 31 with a daily output dataset, ADF creates 31 slices. If the pipeline was paused for a week and missed five days, ADF will backfill those five slices automatically when the pipeline resumes. SQL Agent would have simply missed those runs. This automatic backfill behavior is one of ADF's genuine advantages for data pipelines where gaps are unacceptable.

The tradeoff: the slice model is harder to reason about than simple time-based scheduling. "Why is this slice in Waiting state?" usually means a dependency isn't satisfied — a source dataset slice isn't ready, or an upstream activity failed. The dependency chain tracks through linked datasets across activities.

Activity Types

Copy Activity

The workhorse. Moves data from a source to a sink with optional column mapping, type conversion, and parallelism. The source and sink types determine what options are available. BlobSource lets you skip header lines and handle empty values. SqlSink lets you run a pre-copy script to truncate the target and write via batch insert or a stored procedure. I'll dedicate a full post to Copy Activity — there's enough to cover.

HDInsight Activity

Submits a Hive script, Pig script, or MapReduce job to an HDInsight cluster. The cluster can be an existing one (linked service points to it) or ADF can provision an on-demand cluster, run the job, and tear it down. On-demand clusters add startup latency (8-12 minutes) but eliminate the cost of a permanently running cluster for infrequent jobs.

{
  "name": "HiveTransformActivity",
  "type": "HDInsightHive",
  "inputs": [{ "name": "RawBlobDataset" }],
  "outputs": [{ "name": "ProcessedBlobDataset" }],
  "linkedServiceName": "HDInsightOnDemandLinkedService",
  "typeProperties": {
    "scriptPath": "scripts/transform.hql",
    "scriptLinkedService": "AzureStorageLinkedService"
  }
}

Stored Procedure Activity

Calls a stored procedure on Azure SQL Database or Azure SQL Data Warehouse. This is the primary transformation mechanism for teams without HDInsight — Copy Activity lands data in staging, Stored Procedure Activity runs the transform logic, output dataset signals completion. It works, but it means your transformation logic lives in SQL, not in the pipeline.

{
  "name": "RunTransformProc",
  "type": "SqlServerStoredProcedure",
  "inputs": [{ "name": "StagingDataset" }],
  "outputs": [{ "name": "FinalDataset" }],
  "linkedServiceName": "AzureSQLLinkedService",
  "typeProperties": {
    "storedProcedureName": "usp_TransformSalesData",
    "storedProcedureParameters": {
      "SliceStart": { "value": "$$Text.Format('{0:yyyy-MM-dd}', SliceStart)", "type": "String" }
    }
  }
}

Note the $$Text.Format expression — ADF's built-in expression language for injecting slice metadata into activity parameters. It's limited but essential.

Custom Activity

Runs a .NET assembly on an Azure Batch compute pool. Escape hatch for anything the built-in activity types don't cover. The overhead (Batch pool, assembly packaging) is real — only reach for this when you genuinely have no other option.

Pipeline Active Period and Dependencies

A pipeline's start and end define when it runs. Activities within a pipeline execute when their input datasets have a Ready slice. Dependencies between activities are implicit — the output of Activity A becomes an input of Activity B, so B waits for A's output slice to be Ready before it starts.

This is fundamentally different from SSIS precedence constraints. There is no explicit "on success, run B; on failure, run C" branching. Dependencies in ADF are data-driven: when the data is ready, the downstream activity runs. Multi-path conditional logic is difficult to express in this model — a limitation that will matter in production.

The SSIS mapping: Linked Service is a Connection Manager, Dataset is source/destination metadata plus schedule, Copy Activity is a simple Data Flow Task, Stored Procedure Activity is an Execute SQL Task, Pipeline is a package skeleton. The mappings hold at a surface level. Dig deeper and they break down — which is worth understanding before you try to migrate anything. Next post: the honest head-to-head comparison. I'm here to help.

ADF Pipeline Concepts: Understanding Activities, Datasets, and Linked Services

Shannon Lowder

Linked Services: The Connection Layer

Datasets: Metadata, Not Data

The Slice Model: ADF's Scheduling Mechanism

Activity Types

Copy Activity

HDInsight Activity

Stored Procedure Activity

Custom Activity

Pipeline Active Period and Dependencies

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving