ADLA Monitoring: Finding Out Your Job Did Nothing for Six Hours

Shannon Lowder

15 Jul 2015 — 3 min read

The previous post was about the geocoding job that produced zero rows. This post is about the broader problem it revealed: ADLA's monitoring story in 2015 was inadequate for detecting jobs that ran correctly but did the wrong thing. Here's what ADLA showed you, what it didn't show you, and what you had to build yourself.

What ADLA Did Show You

The job view in the Azure Portal gave you:

Job graph — the DAG of vertices, showing which had completed, were running, or were waiting
Vertex-level stats — execution time, input bytes read, output bytes written per vertex
Overall job status — Running, Succeeded, Failed, Cancelled
Resource usage — AU-hours consumed, peak AU utilization
Error messages — if a vertex threw an exception, the error text appeared in the vertex detail

For debugging failures, this was often sufficient. A vertex that threw a null reference exception showed the exception text. A vertex that ran out of memory showed an OOM error. Clear causes, clear fixes.

What ADLA Didn't Show You

Row counts. Byte counts were available at every vertex; row counts were not. A vertex that read 100 million rows and produced 0 rows showed 0 bytes output — which looked identical to a vertex that correctly produced no output for a legitimate reason. Without row count visibility, there was no built-in signal that a join or filter was more aggressive than expected.

Intermediate cardinality. If a multi-stage U-SQL script reduced 100 million rows to 0 rows in stage 2 of 5, stages 3-5 ran correctly — they processed zero rows. All vertices reported success. The job completed. The only signal was the zero-byte output file at the end.

Data quality metrics. ADLA knew nothing about what you expected the output to look like. If your job was supposed to produce one row per storm event and produced 1/10th of the expected count because of a data quality issue in the source, nothing in ADLA flagged that discrepancy.

The Monitoring Pattern I Built

For any ADLA job that ran on a schedule or processed data I cared about, I added a post-job validation script that ran after the main job completed:

// post-job-validation.usql
// Run after the main geocoding job

@outputRowCount =
    EXTRACT EventId int, State string, Latitude decimal, Longitude decimal
    FROM "/curated/analytics/storm-events-geocoded.csv"
    USING Extractors.Csv(skipFirstNRows: 1);

@stats =
    SELECT COUNT(*)               AS TotalRows,
           COUNT(Latitude)        AS RowsWithLatitude,
           COUNT(Longitude)       AS RowsWithLongitude,
           MIN(Latitude)          AS MinLat,
           MAX(Latitude)          AS MaxLat,
           MIN(Longitude)         AS MinLon,
           MAX(Longitude)         AS MaxLon
    FROM @outputRowCount;

OUTPUT @stats
    TO "/staging/diagnostics/geocoding-validation-{0}.txt"
    USING Outputters.Tsv();

This ran in minutes (reading the output file, not reprocessing the input). The output gave me: total row count (was it roughly what I expected?), null rates for coordinates (were some events failing to geocode?), and lat/long ranges (were coordinates in valid US ranges, or had some parse error produced garbage values?).

The Azure Automation Wrapper

For scheduled jobs, I wired the validation script into the Azure Automation runbook that orchestrated the ADLA submissions:

# PowerShell in Azure Automation
# 1. Submit main job
$mainJobId = Submit-AzureRmDataLakeAnalyticsJob -Account $adlaAccount -Name "GeocodingJob" -ScriptPath "geocoding.usql"
Wait-AzureRmDataLakeAnalyticsJob -Account $adlaAccount -JobId $mainJobId

# 2. Check main job status
$mainJob = Get-AzureRmDataLakeAnalyticsJob -Account $adlaAccount -JobId $mainJobId
if ($mainJob.State -ne "Ended" -or $mainJob.Result -ne "Succeeded") {
    Send-MailMessage -Subject "ADLA Geocoding Job FAILED" -Body $mainJob.ErrorMessage
    exit 1
}

# 3. Submit validation job
$validationJobId = Submit-AzureRmDataLakeAnalyticsJob -Account $adlaAccount -Name "GeocodingValidation" -ScriptPath "geocoding-validation.usql"
Wait-AzureRmDataLakeAnalyticsJob -Account $adlaAccount -JobId $validationJobId

# 4. Read validation output and check thresholds
$stats = Get-Content "/staging/diagnostics/geocoding-validation-*.txt" | ConvertFrom-Csv -Delimiter "`t"
if ([int]$stats.TotalRows -lt 100000) {
    Send-MailMessage -Subject "ADLA Geocoding: Unexpectedly Low Row Count ($($stats.TotalRows))"
}

The Meta-Lesson

ADLA was a compute engine, not a data quality framework. It ran what you gave it and reported whether the execution succeeded. What "succeeded" meant for your business — correct row counts, valid coordinate ranges, expected null rates — was your responsibility to define and verify. The monitoring gap wasn't a bug in ADLA; it was a reminder that correctness and completion are different things, and you need to verify both. As always, I'm here to help.

ADLA Monitoring: Finding Out Your Job Did Nothing for Six Hours

Shannon Lowder

What ADLA Did Show You

What ADLA Didn't Show You

The Monitoring Pattern I Built

The Azure Automation Wrapper

The Meta-Lesson

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving