Azure Data Factory Preview: First Look at Microsoft's Cloud ETL Offering

Every few years a tool shows up that makes you stop and reconsider your defaults. For most of us doing ETL on the Microsoft stack, SSIS is that default. It works, it's mature, you know exactly where to look when it breaks. Then Azure Data Factory hit preview in early 2014, and I had to decide whether to take it seriously or file it under "wait and see."

I took it seriously. Here's what I found.

What ADF Actually Is

Azure Data Factory is Microsoft's cloud-native ETL orchestration service. No servers to provision, no runtime to install, no SSISDB to babysit. You define pipelines in JSON, point them at data sources and sinks, and let Azure run them on a schedule. The pitch is compelling: infrastructure-free data movement at cloud scale, deployable from templates, pay only for what you run.

The preview in early 2014 is rough around the edges, but the architecture is sound. Four core concepts underpin everything in ADF, and understanding them is the prerequisite for everything else.

The Four Core Concepts

Linked Services

A linked service is a connection definition — the ADF equivalent of an SSIS Connection Manager. It tells ADF where a data store lives and how to authenticate to it. Every dataset and every compute resource points back to a linked service.

{
  "name": "AzureStorageLinkedService",
  "properties": {
    "type": "AzureStorage",
    "typeProperties": {
      "connectionString": "DefaultEndpointsProtocol=https;AccountName=mystorageaccount;AccountKey=..."
    }
  }
}

Simple enough. The credentials problem I'll cover in a later post — spoiler: plaintext in JSON is not ideal.

Datasets

A dataset describes the shape and location of data within a linked service. It's not the data itself — it's the metadata: which container, which table, which file path, and how that data is partitioned over time. The partition model is one of ADF's most distinctive features and one of its more confusing ones at first encounter.

{
  "name": "AzureBlobInput",
  "properties": {
    "type": "AzureBlob",
    "linkedServiceName": "AzureStorageLinkedService",
    "typeProperties": {
      "folderPath": "mycontainer/input/",
      "fileName": "data.csv",
      "format": { "type": "TextFormat", "columnDelimiter": "," }
    },
    "availability": { "frequency": "Hour", "interval": 1 },
    "external": true
  }
}

The availability block drives ADF's scheduling model. More on that when we dig into the slice concept.

Activities

An activity is a unit of work within a pipeline. ADF ships with three in this preview: Copy Activity (data movement), HDInsight Activity (Hive/Pig/MapReduce on a managed cluster), and Stored Procedure Activity (call a proc on Azure SQL or SQL Data Warehouse). Custom .NET activities are also supported for anything else.

Copy Activity is the one you'll use most. It handles source-to-sink data movement with optional column mapping, retry logic, and parallelism controls. It is not a data flow — there is no transformation engine equivalent to SSIS's Data Flow Task. Data lands as-is, transformations happen downstream in SQL or HDInsight.

Pipelines

A pipeline is a container for activities with a defined active period. The active period (start date, end date) tells ADF when to run. Activities within a pipeline execute in dependency order defined by the inputs and outputs of each activity — not by explicit precedence constraints the way SSIS packages work.

{
  "name": "CopyPipeline",
  "properties": {
    "start": "2014-03-01T00:00:00Z",
    "end": "2014-04-01T00:00:00Z",
    "isPaused": false,
    "activities": [
      {
        "name": "CopyFromBlobToSQL",
        "type": "Copy",
        "inputs": [{ "name": "AzureBlobInput" }],
        "outputs": [{ "name": "AzureSQLOutput" }],
        "typeProperties": {
          "source": { "type": "BlobSource" },
          "sink": { "type": "SqlSink", "writeBatchSize": 10000, "writeBatchTimeout": "60:00:00" }
        },
        "policy": { "concurrency": 1, "retry": 3, "timeout": "01:00:00" }
      }
    ]
  }
}

What the Preview Gets Right

No infrastructure. Seriously, this alone is worth paying attention to. No VM, no runtime installation, no Windows Service to monitor. ADF runs in Azure's managed compute fabric. For teams that have spent time managing SSIS servers, this is not a minor convenience — it's a fundamentally different operational model.

JSON-deployable from day one. The entire factory — linked services, datasets, pipelines — is JSON. You can check it into source control, template it for multiple environments, and deploy it with PowerShell. SSIS packages are binary blobs until you crack open the XML. ADF's artifacts are text from the start.

Native Azure integration. Blob Storage, Azure SQL Database, SQL Data Warehouse, HDInsight — these aren't bolted-on connectors, they're first-class citizens. The authentication flows through Azure AD and managed identities. No ODBC driver installs, no DSN configuration.

What the Preview Gets Wrong

No git integration. This is the gap I keep coming back to. ADF's authoring experience is a browser-based JSON editor. There is no link to a git repository. There is no commit on save. There is no diff before you deploy. You are editing live configuration in a web form, and if you close the tab before saving, your work is gone. For a tool that generates JSON artifacts that are obviously suited for version control, this is a baffling omission.

No transformation engine. Copy Activity moves data. If you need a Derived Column transformation, a Lookup join, a Conditional Split, or an Aggregate — that's not here. You'll push data to staging and run a stored procedure. That's the pattern. It works, but it means ADF is an orchestration layer, not a full ETL platform.

Connector library is thin. Azure Blob, Azure SQL Database, SQL Server via Data Management Gateway, Oracle, MySQL. That's the list in preview. No Salesforce, no SAP, no REST/HTTP sources, no FTP. If your source system isn't on that list, you're writing a custom activity.

The Preview Verdict

ADF is promising and immature. The architecture is right — cloud-native, infrastructure-free, JSON-deployable. The execution in preview is incomplete — no git, thin transforms, limited connectors. This is a tool worth watching and worth piloting for cloud-native workloads. I would not migrate production SSIS workloads to it today.

Over the next several posts I'll go deep on each of the four core concepts, then compare ADF and SSIS head-to-head so you have a real decision framework. If you're evaluating ADF right now and have questions, I'm here to help.

Read more