Data Engineers Have a Network Problem Too: Latency Patterns in ETL Pipelines
Most of the writing about cloud latency and connection resiliency focuses on application developers. It's their connection pools that drain and their N+1 queries that explode. But data engineers running ETL pipelines against cloud endpoints had their own version of the same problem, and the failure modes were different enough to warrant separate treatment.
I spent a good part of 2011 figuring out which SSIS patterns and T-SQL batch patterns broke against SQL Azure and what to replace them with. This is what I learned.
SSIS on a LAN vs SSIS to SQL Azure
SSIS was designed for LAN-speed connections. The default buffer sizes, batch commit intervals, and data flow architectures in the toolbox all assumed that data movement between source and destination was fast and reliable. Moving those packages to run against Azure endpoints exposed the assumptions.
The row-by-row commit problem. An OLE DB Destination with the default settings commits rows in batches of 2,048. On a LAN with a 0.2ms round-trip, the commit overhead is negligible. At 15ms per commit to SQL Azure, that's 15ms × (total rows / 2,048) in commit overhead alone. For a million-row load, that's nearly 7 minutes of pure commit latency. Increasing the batch size to 50,000–100,000 rows dropped the commit overhead by 25×.
-- OLE DB Destination setting in SSIS
-- Rows per batch: 50000 (up from default 2048)
-- Maximum insert commit size: 100000 (up from default 2147483647, which ignores this)The lookup transformation cache problem. A Lookup transformation in SSIS hits the reference table for each row by default. Against a LAN database, this was acceptable. Against SQL Azure at 15ms per lookup, a 100,000-row data flow with a lookup on every row was 25 minutes of lookup latency. The fix: use Full Cache mode, which loads the entire reference table into memory before processing. One round-trip instead of 100,000.
Connection managers and transient errors. SSIS didn't have built-in retry logic for transient SQL Azure failures in 2011. When a connection dropped mid-execution, the package failed. The pattern that worked: wrap the data flow in a For Loop container with a retry counter, catch the connection failure at the package level, and re-execute with exponential backoff. Not elegant, but functional.
T-SQL Batch Patterns for Cloud Targets
Stored procedures and T-SQL batches running against SQL Azure had their own set of anti-patterns.
Cursor-based row processing. A cursor that processed rows one at a time was 15ms per row at Azure latency. A million-row cursor was four hours. Set-based operations that processed all matching rows in a single statement — UPDATE, INSERT...SELECT, MERGE — were the obvious fix, but that required rewriting logic that had been working "fine" on-prem for years.
WHILE loops as pseudo-cursors. The DBA community had mostly learned to avoid cursors by 2011, but WHILE loops processing rows in small batches were common. Same problem: each loop iteration was a round-trip.
-- Anti-pattern: WHILE loop with per-row work
WHILE @rowsRemaining > 0
BEGIN
UPDATE TOP(1) dbo.ProcessingQueue
SET ProcessedAt = GETUTCDATE()
WHERE ProcessedAt IS NULL;
SET @rowsRemaining = @rowsRemaining - @@ROWCOUNT;
END
-- Better: single set-based operation
UPDATE dbo.ProcessingQueue
SET ProcessedAt = GETUTCDATE()
WHERE ProcessedAt IS NULL;Frequent small commits in a loop. SSIS wasn't the only place where commit granularity mattered. T-SQL batches that committed every 100 rows for "safety" were generating one round-trip per 100 rows. For large batches, consolidating to a single commit at the end (if idempotency allowed it) or committing every 10,000–50,000 rows significantly reduced the commit overhead.
The Staging Pattern
The most reliable pattern I found for large ETL operations against SQL Azure was to stage data locally before loading it to the cloud target. Extract from the source to Azure Blob Storage (fast, high throughput). Load from Blob Storage to SQL Azure using BULK INSERT or BCP (single round-trip per file, not per row). Transform in SQL Azure using set-based operations after the raw data is loaded.
This pattern eliminated the row-by-row latency problem entirely. The cloud endpoint was only involved in the bulk load step, which was optimized for throughput rather than latency. The transformation step ran inside the SQL Azure engine, which had LAN-speed access to its own data.
The Lesson That Transferred
Every cloud data migration I ran after mid-2011 included a pipeline performance audit before the migration. The checklist:
- Any SSIS Lookup transformations using row-by-row mode instead of Full Cache?
- Any SSIS destinations with commit sizes under 10,000 rows?
- Any cursor or WHILE loop row processing in the target database?
- Any batch jobs relying on sequential small commits?
Finding and fixing these before migration meant the pipeline performed acceptably on day one instead of discovering the problems under production load. That's the whole game. As always, I'm here to help.