ADF and Azure Data Lake Storage: The Ingestion Architecture I've Settled On

I've been using ADLS Gen1 as the raw landing zone for analytics workloads for about a year and a half. I've iterated through a few approaches to the ADF-to-ADLS ingestion pattern and I've landed on something that works well enough to write up.

The short version: ADF handles the extract-and-land phase, raw data goes to ADLS with a consistent partitioned path structure, downstream compute (Databricks or HDInsight) picks it up for transformation. Simple. The details matter.

The Storage Structure

Consistency in path structure pays off when you have dozens of sources landing data in the same lake. The structure I use:

/raw/{source-system}/{year}/{month}/{day}/{filename}

Examples:

/raw/salesforce/2017/06/15/Opportunity_20170615.csv
/raw/sql-server-erp/2017/06/15/Orders_20170615_000000.parquet
/raw/vendor-ftp/2017/06/15/inventory_file_20170615.csv

Year/month/day partitioning enables downstream tools to use partition pruning when reading. Hive, U-SQL, Databricks — they all understand path-based partitioning. The date in the filename is redundant with the folder structure but I keep it for clarity when files end up in operational tooling or manual inspection.

For high-frequency sources (hourly loads), I add an hour level:

/raw/{source-system}/{year}/{month}/{day}/{hour}/{filename}

The ADF Pipeline Pattern

The tumbling window trigger is the right trigger for this pattern. The window start and end times drive the partition path and the incremental extract predicate.

{
  "name": "IngestToADLS",
  "parameters": {
    "SourceTable":   { "type": "String" },
    "SourceSystem":  { "type": "String" },
    "WindowStart":   { "type": "String" },
    "WindowEnd":     { "type": "String" }
  },
  "activities": [
    {
      "name": "CopyToADLS",
      "type": "Copy",
      "typeProperties": {
        "source": {
          "type": "SqlSource",
          "sqlReaderQuery": {
            "value": "@concat('SELECT * FROM dbo.[', pipeline().parameters.SourceTable, '] WHERE ModifiedDate >= ''', pipeline().parameters.WindowStart, ''' AND ModifiedDate < ''', pipeline().parameters.WindowEnd, '''')",
            "type": "Expression"
          }
        },
        "sink": {
          "type": "AzureDataLakeStoreSink",
          "fileName": {
            "value": "@concat(pipeline().parameters.SourceTable, '_', replace(replace(pipeline().parameters.WindowStart, ':', ''), '-', ''), '.parquet')",
            "type": "Expression"
          },
          "folderPath": {
            "value": "@concat('raw/', pipeline().parameters.SourceSystem, '/', formatDateTime(pipeline().parameters.WindowStart, 'yyyy'), '/', formatDateTime(pipeline().parameters.WindowStart, 'MM'), '/', formatDateTime(pipeline().parameters.WindowStart, 'dd'))",
            "type": "Expression"
          }
        }
      }
    }
  ]
}

Authentication: Service Principal, Not Storage Key

The ADLS linked service in ADF supports two authentication methods: service principal and storage account key. Use service principal in production. Always.

Storage account key authentication gives any process holding that key full read/write access to every container in the account. A leaked key compromises the entire storage account. Rotating the key requires updating every system that uses it.

Service principal authentication gives the service principal you create exactly the permissions you grant it. Different ADF instances can have different service principals with different permission scopes. Rotating credentials means rotating the service principal secret, which affects only the systems using that specific principal.

{
  "name": "ADLSLinkedService",
  "type": "AzureDataLakeStore",
  "typeProperties": {
    "dataLakeStoreUri": "https://yourdatalake.azuredatalakestore.net/webhdfs/v1",
    "servicePrincipalId": "your-sp-app-id",
    "servicePrincipalKey": {
      "type": "SecureString",
      "value": "your-sp-key"
    },
    "tenant": "your-tenant-id"
  }
}

Grant the service principal RWX on the folders it needs to write. Read-only access for sources ADF reads from. Write access on the raw landing folders. This is the least-privilege principle applied to storage access.

File Format: Parquet Over CSV

For SQL Server sources going to ADLS, write Parquet, not CSV. The reasons:

Parquet preserves data types. CSV converts everything to strings and relies on the reader to cast them back. Null values in dates, integers, and decimals are a constant source of downstream processing bugs with CSV.
Parquet is columnar and compressed. A table with 50 columns where you typically query 10 of them reads much faster from Parquet — the reader skips the columns it doesn't need. CSV forces full row reads.
Parquet file sizes are much smaller than equivalent CSV for typical analytical data. ADLS storage costs are not the concern, but smaller files mean faster reads and less network traffic in downstream processing.

ADF supports writing Parquet to ADLS natively. Set the format type in the sink dataset to ParquetFormat and you're done.

The Monitoring Gap

One frustration with this pattern: ADF doesn't tell you the exact byte count or row count written to ADLS during an active copy run. The activity run detail shows dataWritten (bytes) and rowsCopied after completion, but during the run there's no progress indicator. For large extracts that take 20-30 minutes, you're watching a spinning indicator with no indication of how far through the data the copy is.

The workaround: ADLS diagnostic logs show file creation events with sizes. Enable ADLS diagnostic logging to Log Analytics and you can see files being written in near-real-time. It's not the same as in-pipeline progress reporting, but it's something.

The downstream compute also provides a check: when Databricks reads the Parquet file and runs its first count, you get the row count. This isn't real-time monitoring but it's a useful sanity check in the transformation layer.

What I'd Change

ADLS Gen2 is on the horizon — a merger of ADLS Gen1 with Azure Blob Storage hierarchical namespace support. The connector story will be different. I'm building this architecture on Gen1 because that's what's production-stable today, but I'm watching Gen2 closely. The path structure and partitioning approach will carry over; the authentication and linked service configuration will change.

If you're designing an ADLS ingestion architecture now and want to talk through the Gen1 vs. Gen2 question or the partitioning strategy for your specific source mix, I'm here to help.

ADF and Azure Data Lake Storage: The Ingestion Architecture I've Settled On

Shannon Lowder

The Storage Structure

The ADF Pipeline Pattern

Authentication: Service Principal, Not Storage Key

File Format: Parquet Over CSV

The Monitoring Gap

What I'd Change

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving