Copy Activity has matured since the 2014 preview. New source and sink types, PolyBase support for SQL Data Warehouse, and enough production experience to know the patterns that work and the ones that fight you. This is the deep dive SSIS developers need before they build their first production ADF pipeline.
The Source Query for SQL Sources
When your source is a SQL database, you are not limited to reading an entire table. The sqlReaderQuery property lets you provide a full SQL statement:
{
"name": "IncrementalCopyFromSQL",
"type": "Copy",
"inputs": [{ "name": "SourceSQLDataset" }],
"outputs": [{ "name": "StagingBlobDataset" }],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM dbo.Orders WHERE ModifiedDate >= '$$Text.Format('{0:yyyy-MM-dd}', SliceStart)' AND ModifiedDate < '$$Text.Format('{0:yyyy-MM-dd}', SliceEnd)'"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 10000,
"writeBatchTimeout": "01:00:00"
}
},
"policy": { "concurrency": 1, "retry": 3, "timeout": "02:00:00" }
}
The $$Text.Format expressions inject SliceStart and SliceEnd into the query at runtime. This is the foundation of incremental load in ADF — each daily slice reads only the rows modified during that slice's time window. Simple, right? The catch: you need a reliable ModifiedDate column on the source table. If that column isn't maintained accurately by the source system, the incremental query produces incorrect results silently.
PolyBase Staging for SQL Data Warehouse
Loading data into Azure SQL Data Warehouse via standard INSERT is slow for large volumes. PolyBase — which loads from Azure Blob Storage directly into DW without going through the SQL engine's row-by-row path — is dramatically faster. ADF supports PolyBase staging natively in Copy Activity:
{
"name": "CopyToSQLDW",
"type": "Copy",
"inputs": [{ "name": "SourceDataset" }],
"outputs": [{ "name": "SQLDWOutputDataset" }],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"polyBaseSettings": {
"rejectType": "percentage",
"rejectValue": 10,
"rejectSampleValue": 200,
"useTypeDefault": true
},
"writeBatchSize": 10000
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "AzureStorageStagingLinkedService",
"path": "polybase-staging",
"enableCompression": true
}
}
}
With PolyBase enabled, ADF copies source data to the staging Blob location, then issues a PolyBase COPY command from DW to load from Blob. For multi-GB loads the throughput improvement is significant — expect 3-5x over standard INSERT depending on row width and DW DWU configuration. The staging Blob is cleaned up after the load completes.
The rejectValue and rejectSampleValue settings control how many rows PolyBase will tolerate as non-conformant before failing the load. Set these based on your data quality tolerance — a strict rejectValue: 0 means a single bad row fails the entire load.
All Source and Sink Types in Mid-2015
Sources: BlobSource, FileSystemSource, SqlSource, SqlDWSource, OracleSource, MySqlSource, FtpSource, HttpSource, AzureTableSource, DocumentDbCollectionSource, AzureDataLakeStoreSource.
Sinks: BlobSink, SqlSink, SqlDWSink (with PolyBase), OracleSink, AzureTableSink, DocumentDbCollectionSink, AzureDataLakeStoreSink.
The connector list has grown meaningfully from 2014. FTP as a source is now production-ready — I've had client pipelines pulling vendor files from FTP for two months without issues.
Patterns That Work
Copy to staging, then transform in SQL. This is the ADF idiom. Copy Activity lands data in a staging table. Stored Procedure Activity runs the transform — MERGE, UPSERT, aggregation, business logic. Output dataset signals completion. Clean separation of concerns, debuggable at each step.
Incremental load with high-water mark. Control table in SQL stores the last successful load date per pipeline. Copy Activity reads the high-water mark, uses it in sqlReaderQuery. Post-copy stored procedure updates the high-water mark. Each slice advances the mark. Pattern:
-- Control table
CREATE TABLE dbo.WatermarkControl (
PipelineName NVARCHAR(100) PRIMARY KEY,
LastLoadDate DATETIME
);
-- Source query using watermark (referenced in ADF via stored procedure activity first)
SELECT * FROM dbo.Orders
WHERE ModifiedDate > @LastLoadDate AND ModifiedDate <= @SliceEnd;
-- Post-copy stored procedure
UPDATE dbo.WatermarkControl
SET LastLoadDate = @SliceEnd
WHERE PipelineName = 'OrdersIncremental';
The limitation: this pattern requires two activities per pipeline slice (one to read the watermark, one to update it), and concurrent slice execution can cause watermark race conditions. Set concurrency: 1 on the pipeline if watermark integrity matters.
What Copy Activity Still Cannot Do
No row-level filtering from Blob sources. If you want only rows where Status = 'ACTIVE' from a CSV file, the entire file lands — you filter in the post-copy stored procedure. For large files where most rows are filtered out, this is an efficiency problem.
No aggregation. No joins across sources in a single Copy Activity. No incremental load from Blob without external watermark management. No schema-on-read from unstructured Blob content — the dataset must declare the schema explicitly.
These limitations aren't going to change — they're architectural, not gaps in the current release. Design your pipelines with these constraints in mind from the start and they won't surprise you in production. If you need transformations that Copy Activity can't do, they live in SQL stored procedures. That's the deal.
Next post: the git integration post I've been building toward for a year. It's time to name the problem directly. If you're wrestling with Copy Activity configuration, I'm here to help.