The Network Was the Bottleneck: Ingesting NOAA, USDA, and NCIS in Days

One of the projects I'm most proud of from this period was building the automation that staged all of NOAA, USDA, and NCIS data into our Azure Data Lake. I say "proud" not because the code was clever — it was mostly straightforward download-and-stage logic — but because of what we discovered when we ran it. The limiting factor wasn't our pipeline. It wasn't Azure. It was the outbound bandwidth at the source agencies.

That realization changed how I thought about data ingestion architecture.

The Scope

NOAA, USDA, and NCIS collectively expose decades of public data: weather observations, storm events, climate records, agricultural data, geospatial reference data, and more. The total volume in 2015 was hundreds of gigabytes of CSV, fixed-width, and binary files, organized across FTP sites and HTTP endpoints with varying degrees of documentation.

The business goal: stage all of it into ADLS as a foundation for client data science projects. Instead of each project team going to find and download the data they needed, we'd have it already staged, already organized by zone, already indexed in our metadata store.

The Automation Architecture

The ingestion automation was a C# application that read a catalog metadata table — one row per data feed, containing the source URL/FTP path, the ADLS destination path, the schedule (full reload vs incremental), and the last successful download timestamp. For each active feed, it checked whether a new file was available and downloaded it to ADLS if so.

public class DataFeedIngestor
{
    public async Task IngestFeedAsync(DataFeed feed, CancellationToken ct)
    {
        // Check if source has newer data than our last download
        var sourceLastModified = await GetSourceLastModifiedAsync(feed.SourceUrl);
        if (sourceLastModified <= feed.LastSuccessfulDownload)
        {
            _log.Information("Feed {FeedName}: no new data since {LastDownload}",
                feed.Name, feed.LastSuccessfulDownload);
            return;
        }

        // Download to ADLS
        var destinationPath = BuildDestinationPath(feed, sourceLastModified);
        using var sourceStream = await OpenSourceStreamAsync(feed.SourceUrl, ct);
        await _adlsClient.UploadAsync(destinationPath, sourceStream, ct);

        // Update catalog
        await UpdateFeedCatalogAsync(feed.Id, sourceLastModified, destinationPath);
        _log.Information("Feed {FeedName}: downloaded to {DestPath}", feed.Name, destinationPath);
    }
}

The catalog-driven approach meant adding a new data feed was a metadata entry, not a code change. The same automation served a hundred feeds or a thousand feeds without modification.

What We Expected vs What Happened

Before running the initial full load, we estimated the time based on our Azure network throughput — which was fast, measured in hundreds of megabits per second for large file transfers within Azure. We expected the multi-hundred-gigabyte initial load to take a few days.

It took about the same amount of time, but not because of our pipeline or Azure. The FTP servers at NOAA and USDA were rate-limited. NOAA's FTP site, serving thousands of researchers simultaneously, was throttling individual connections to a few megabits per second. Some endpoints were slower. A few USDA feeds were served from what appeared to be a 1990s-era server with all the download speed that implies.

Our bottleneck was the source agencies' outbound bandwidth, not ours.

What This Changed About the Architecture

The discovery had three practical implications:

Parallelism at the feed level, not the file level. We couldn't make any individual feed faster — source throttling was external. But we could run multiple feeds simultaneously. Running 20 feeds in parallel at 2 Mbps each was the same total throughput as one feed at 40 Mbps, but actually achievable. The ingestion service ran feeds in parallel up to a configurable concurrency limit.

Resilience over speed. A download that took 8 hours needed retry logic, checkpointing (resume from where it left off if interrupted), and a way to verify completeness after the download. A fast download could just re-download from scratch on failure. A slow download needed to be restartable.

Cache aggressively. Once data was in ADLS, it never needed to be downloaded again unless the source changed. The catalog tracked the source's last-modified timestamp so we only downloaded what was new. Redundant downloads from slow external sources were avoided by design.

The Lesson: Know Your Bottleneck

The most common optimization mistake is optimizing the wrong component. Our first instinct was to look at our pipeline code for ways to make it faster. The actual bottleneck was outside our system entirely. Before optimizing, measure where the time is going. It's usually not where you expect. As always, I'm here to help.

Read more