The connector list has grown significantly since ADF's 2014 preview. Before I cover the new connectors, I want to go deep on datasets — specifically the partition model and the external flag, which are the two most common sources of pipeline failures for developers new to ADF.
Updated Connector List for Early 2015
What's available now that wasn't in 2014:
- Azure Data Lake Store — First-class support. ADLS is the storage layer for analytics workloads on Azure; having it as a native ADF source and sink is significant.
- FTP — Linked service for FTP sources. Copy Activity can pull files from FTP into Blob or Azure SQL. Finally handles the "vendor drops a file on an FTP server" pattern without a custom activity.
- SFTP — In preview. Same pattern as FTP but with SSH authentication.
- Salesforce — In preview. Limited to specific objects. Treat it as experimental until it exits preview.
- Oracle — Now stable via DMG. Had some rough edges in 2014; in production reliably now.
- MySQL — Stable via DMG.
- HDFS — Hadoop Distributed File System as a source via DMG. For on-premises Hadoop environments.
Each new connector follows the same linked service + dataset + Copy Activity pattern. Onboarding a new source is JSON configuration, not code.
Datasets: The Partition Model
The partition model in ADF datasets is what makes time-based data organization first-class. Instead of hard-coding a file path, you define a path that incorporates slice timing:
{
"name": "DailySalesBlob",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "sales/input/{Year}/{Month}/{Day}/",
"fileName": "sales.csv",
"partitionedBy": [
{
"name": "Year",
"value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" }
},
{
"name": "Month",
"value": { "type": "DateTime", "date": "SliceStart", "format": "MM" }
},
{
"name": "Day",
"value": { "type": "DateTime", "date": "SliceStart", "format": "dd" }
}
],
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"firstRowAsHeader": true
}
},
"availability": { "frequency": "Day", "interval": 1 },
"external": true
}
}
When ADF processes the slice for 2015-03-15, it resolves {Year} to 2015, {Month} to 03, {Day} to 15, producing the path sales/input/2015/03/15/sales.csv. Each day's slice reads from its own path partition. This is the standard pattern for landing zone organization where upstream processes drop files into date-partitioned folders.
You can also use SliceEnd instead of SliceStart — useful when files are named by the end of the window they cover. The format string accepts any .NET DateTime format specifier.
The External Flag: The Most Misunderstood Configuration
Every source dataset in ADF must have "external": true in its properties. This tells ADF that the data for this dataset is produced by something outside the factory. ADF will check whether the data exists but will not wait for a factory pipeline to create it.
Without external: true on a source dataset, ADF assumes the factory is responsible for creating that data. It looks for an ADF pipeline that outputs to this dataset. When it can't find one, the slice stays in Waiting state indefinitely. The pipeline appears to hang. There is no error message that says "you forgot external: true." You have to know to look for it.
-- Dataset WITHOUT external (ADF thinks factory should produce this)
"availability": { "frequency": "Day", "interval": 1 }
-- Missing: "external": true -- pipeline will wait forever
-- Dataset WITH external (ADF knows data comes from outside)
"availability": { "frequency": "Day", "interval": 1 },
"external": true
Every source dataset you create: add external: true. Every output dataset you create: do not add external: true. This is the rule. Make it a checklist item.
SQL Dataset vs. Blob Dataset
SQL dataset pointing to a specific table:
{
"name": "StagingSalesTable",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": "AzureSQLLinkedService",
"typeProperties": {
"tableName": "dbo.SalesStaging"
},
"availability": { "frequency": "Day", "interval": 1 }
}
}
No partitioning here — the table exists independently of slice timing. The slice model applies to the dataset's availability and scheduling, not to the table structure. If you need slice-specific table access, use a stored procedure dataset with slice parameters passed via the activity definition.
ADLS dataset with partitioned path (same pattern as Blob but pointing to ADLS):
{
"name": "ADLSRawData",
"properties": {
"type": "AzureDataLakeStore",
"linkedServiceName": "ADLSLinkedService",
"typeProperties": {
"folderPath": "/raw/sales/{Year}/{Month}/",
"partitionedBy": [
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" }},
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" }}
],
"format": { "type": "JsonFormat", "filePattern": "setOfObjects" }
},
"availability": { "frequency": "Month", "interval": 1 },
"external": true
}
}
The Linked Service / Dataset / Activity Dependency Chain
Every pipeline follows this dependency chain: Linked Service defines the connection, Dataset references the Linked Service and defines the data structure and schedule, Activity references input and output Datasets and executes when the input slice is Ready. If anything in this chain is misconfigured, the pipeline fails — often with a non-obvious error message.
Debugging sequence when a pipeline won't run: first check the slice state of input datasets (are source slices Ready?), then check the linked service connectivity (can ADF reach the data store?), then check the dataset definition (is external: true set correctly?), then check the activity definition. Work from the outside in.
Next post: Copy Activity deep dive for 2015, including PolyBase staging for SQL Data Warehouse loads and the incremental load pattern. If you're troubleshooting a dataset or linked service issue, I'm here to help.