ADLA Monitoring: Finding Out Your Job Did Nothing for Six Hours
The previous post was about the geocoding job that produced zero rows. This post is about the broader problem it revealed: ADLA's monitoring story in 2015 was inadequate for detecting jobs that ran correctly but did the wrong thing. Here's what ADLA showed you, what it didn't show you, and what you had to build yourself.
What ADLA Did Show You
The job view in the Azure Portal gave you:
- Job graph — the DAG of vertices, showing which had completed, were running, or were waiting
- Vertex-level stats — execution time, input bytes read, output bytes written per vertex
- Overall job status — Running, Succeeded, Failed, Cancelled
- Resource usage — AU-hours consumed, peak AU utilization
- Error messages — if a vertex threw an exception, the error text appeared in the vertex detail
For debugging failures, this was often sufficient. A vertex that threw a null reference exception showed the exception text. A vertex that ran out of memory showed an OOM error. Clear causes, clear fixes.
What ADLA Didn't Show You
Row counts. Byte counts were available at every vertex; row counts were not. A vertex that read 100 million rows and produced 0 rows showed 0 bytes output — which looked identical to a vertex that correctly produced no output for a legitimate reason. Without row count visibility, there was no built-in signal that a join or filter was more aggressive than expected.
Intermediate cardinality. If a multi-stage U-SQL script reduced 100 million rows to 0 rows in stage 2 of 5, stages 3-5 ran correctly — they processed zero rows. All vertices reported success. The job completed. The only signal was the zero-byte output file at the end.
Data quality metrics. ADLA knew nothing about what you expected the output to look like. If your job was supposed to produce one row per storm event and produced 1/10th of the expected count because of a data quality issue in the source, nothing in ADLA flagged that discrepancy.
The Monitoring Pattern I Built
For any ADLA job that ran on a schedule or processed data I cared about, I added a post-job validation script that ran after the main job completed:
// post-job-validation.usql
// Run after the main geocoding job
@outputRowCount =
EXTRACT EventId int, State string, Latitude decimal, Longitude decimal
FROM "/curated/analytics/storm-events-geocoded.csv"
USING Extractors.Csv(skipFirstNRows: 1);
@stats =
SELECT COUNT(*) AS TotalRows,
COUNT(Latitude) AS RowsWithLatitude,
COUNT(Longitude) AS RowsWithLongitude,
MIN(Latitude) AS MinLat,
MAX(Latitude) AS MaxLat,
MIN(Longitude) AS MinLon,
MAX(Longitude) AS MaxLon
FROM @outputRowCount;
OUTPUT @stats
TO "/staging/diagnostics/geocoding-validation-{0}.txt"
USING Outputters.Tsv();This ran in minutes (reading the output file, not reprocessing the input). The output gave me: total row count (was it roughly what I expected?), null rates for coordinates (were some events failing to geocode?), and lat/long ranges (were coordinates in valid US ranges, or had some parse error produced garbage values?).
The Azure Automation Wrapper
For scheduled jobs, I wired the validation script into the Azure Automation runbook that orchestrated the ADLA submissions:
# PowerShell in Azure Automation
# 1. Submit main job
$mainJobId = Submit-AzureRmDataLakeAnalyticsJob -Account $adlaAccount -Name "GeocodingJob" -ScriptPath "geocoding.usql"
Wait-AzureRmDataLakeAnalyticsJob -Account $adlaAccount -JobId $mainJobId
# 2. Check main job status
$mainJob = Get-AzureRmDataLakeAnalyticsJob -Account $adlaAccount -JobId $mainJobId
if ($mainJob.State -ne "Ended" -or $mainJob.Result -ne "Succeeded") {
Send-MailMessage -Subject "ADLA Geocoding Job FAILED" -Body $mainJob.ErrorMessage
exit 1
}
# 3. Submit validation job
$validationJobId = Submit-AzureRmDataLakeAnalyticsJob -Account $adlaAccount -Name "GeocodingValidation" -ScriptPath "geocoding-validation.usql"
Wait-AzureRmDataLakeAnalyticsJob -Account $adlaAccount -JobId $validationJobId
# 4. Read validation output and check thresholds
$stats = Get-Content "/staging/diagnostics/geocoding-validation-*.txt" | ConvertFrom-Csv -Delimiter "`t"
if ([int]$stats.TotalRows -lt 100000) {
Send-MailMessage -Subject "ADLA Geocoding: Unexpectedly Low Row Count ($($stats.TotalRows))"
}The Meta-Lesson
ADLA was a compute engine, not a data quality framework. It ran what you gave it and reported whether the execution succeeded. What "succeeded" meant for your business — correct row counts, valid coordinate ranges, expected null rates — was your responsibility to define and verify. The monitoring gap wasn't a bug in ADLA; it was a reminder that correctness and completion are different things, and you need to verify both. As always, I'm here to help.