I've been using ADLS Gen1 as the raw landing zone for analytics workloads for about a year and a half. I've iterated through a few approaches to the ADF-to-ADLS ingestion pattern and I've landed on something that works well enough to write up.
The short version: ADF handles the extract-and-land phase, raw data goes to ADLS with a consistent partitioned path structure, downstream compute (Databricks or HDInsight) picks it up for transformation. Simple. The details matter.
The Storage Structure
Consistency in path structure pays off when you have dozens of sources landing data in the same lake. The structure I use:
/raw/{source-system}/{year}/{month}/{day}/{filename}
Examples:
/raw/salesforce/2017/06/15/Opportunity_20170615.csv
/raw/sql-server-erp/2017/06/15/Orders_20170615_000000.parquet
/raw/vendor-ftp/2017/06/15/inventory_file_20170615.csv
Year/month/day partitioning enables downstream tools to use partition pruning when reading. Hive, U-SQL, Databricks — they all understand path-based partitioning. The date in the filename is redundant with the folder structure but I keep it for clarity when files end up in operational tooling or manual inspection.
For high-frequency sources (hourly loads), I add an hour level:
/raw/{source-system}/{year}/{month}/{day}/{hour}/{filename}
The ADF Pipeline Pattern
The tumbling window trigger is the right trigger for this pattern. The window start and end times drive the partition path and the incremental extract predicate.
{
"name": "IngestToADLS",
"parameters": {
"SourceTable": { "type": "String" },
"SourceSystem": { "type": "String" },
"WindowStart": { "type": "String" },
"WindowEnd": { "type": "String" }
},
"activities": [
{
"name": "CopyToADLS",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": {
"value": "@concat('SELECT * FROM dbo.[', pipeline().parameters.SourceTable, '] WHERE ModifiedDate >= ''', pipeline().parameters.WindowStart, ''' AND ModifiedDate < ''', pipeline().parameters.WindowEnd, '''')",
"type": "Expression"
}
},
"sink": {
"type": "AzureDataLakeStoreSink",
"fileName": {
"value": "@concat(pipeline().parameters.SourceTable, '_', replace(replace(pipeline().parameters.WindowStart, ':', ''), '-', ''), '.parquet')",
"type": "Expression"
},
"folderPath": {
"value": "@concat('raw/', pipeline().parameters.SourceSystem, '/', formatDateTime(pipeline().parameters.WindowStart, 'yyyy'), '/', formatDateTime(pipeline().parameters.WindowStart, 'MM'), '/', formatDateTime(pipeline().parameters.WindowStart, 'dd'))",
"type": "Expression"
}
}
}
}
]
}
Authentication: Service Principal, Not Storage Key
The ADLS linked service in ADF supports two authentication methods: service principal and storage account key. Use service principal in production. Always.
Storage account key authentication gives any process holding that key full read/write access to every container in the account. A leaked key compromises the entire storage account. Rotating the key requires updating every system that uses it.
Service principal authentication gives the service principal you create exactly the permissions you grant it. Different ADF instances can have different service principals with different permission scopes. Rotating credentials means rotating the service principal secret, which affects only the systems using that specific principal.
{
"name": "ADLSLinkedService",
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://yourdatalake.azuredatalakestore.net/webhdfs/v1",
"servicePrincipalId": "your-sp-app-id",
"servicePrincipalKey": {
"type": "SecureString",
"value": "your-sp-key"
},
"tenant": "your-tenant-id"
}
}
Grant the service principal RWX on the folders it needs to write. Read-only access for sources ADF reads from. Write access on the raw landing folders. This is the least-privilege principle applied to storage access.
File Format: Parquet Over CSV
For SQL Server sources going to ADLS, write Parquet, not CSV. The reasons:
- Parquet preserves data types. CSV converts everything to strings and relies on the reader to cast them back. Null values in dates, integers, and decimals are a constant source of downstream processing bugs with CSV.
- Parquet is columnar and compressed. A table with 50 columns where you typically query 10 of them reads much faster from Parquet — the reader skips the columns it doesn't need. CSV forces full row reads.
- Parquet file sizes are much smaller than equivalent CSV for typical analytical data. ADLS storage costs are not the concern, but smaller files mean faster reads and less network traffic in downstream processing.
ADF supports writing Parquet to ADLS natively. Set the format type in the sink dataset to ParquetFormat and you're done.
The Monitoring Gap
One frustration with this pattern: ADF doesn't tell you the exact byte count or row count written to ADLS during an active copy run. The activity run detail shows dataWritten (bytes) and rowsCopied after completion, but during the run there's no progress indicator. For large extracts that take 20-30 minutes, you're watching a spinning indicator with no indication of how far through the data the copy is.
The workaround: ADLS diagnostic logs show file creation events with sizes. Enable ADLS diagnostic logging to Log Analytics and you can see files being written in near-real-time. It's not the same as in-pipeline progress reporting, but it's something.
The downstream compute also provides a check: when Databricks reads the Parquet file and runs its first count, you get the row count. This isn't real-time monitoring but it's a useful sanity check in the transformation layer.
What I'd Change
ADLS Gen2 is on the horizon — a merger of ADLS Gen1 with Azure Blob Storage hierarchical namespace support. The connector story will be different. I'm building this architecture on Gen1 because that's what's production-stable today, but I'm watching Gen2 closely. The path structure and partitioning approach will carry over; the authentication and linked service configuration will change.
If you're designing an ADLS ingestion architecture now and want to talk through the Gen1 vs. Gen2 question or the partitioning strategy for your specific source mix, I'm here to help.