ADF Monitoring in 2019: Building the Dashboard You Actually Need

The native ADF Monitor view is fine for debugging a specific pipeline run. It's not fine for fleet-wide health monitoring. When I need to know "how are all my ADF pipelines doing this week?" the answer is not "open the Monitor view, click through each pipeline, look at each run individually."

Here's the monitoring architecture I've settled on for production ADF environments, and the KQL queries that actually tell me what I need to know.

Step 1: Enable Diagnostic Logging

This takes 30 seconds and should be the first thing you do with any new ADF instance. In the Azure portal, navigate to your Data Factory → Diagnostic settings → Add diagnostic setting. Route to a Log Analytics workspace. Check all log categories: PipelineRuns, ActivityRuns, TriggerRuns.

Give the logs 5-10 minutes to start flowing. Once they're flowing, everything else in this post is possible.

The log tables you'll query:

  • ADFPipelineRun — one row per pipeline run
  • ADFActivityRun — one row per activity run within a pipeline
  • ADFTriggerRun — one row per trigger fire

Useful KQL Queries

Pipeline Failure Rate — Last 7 Days

ADFPipelineRun
| where TimeGenerated >= ago(7d)
| summarize
    Total = count(),
    Succeeded = countif(Status == 'Succeeded'),
    Failed = countif(Status == 'Failed'),
    Cancelled = countif(Status == 'Cancelled')
    by PipelineName
| extend FailureRate = round(todouble(Failed) / todouble(Total) * 100, 1)
| where Total > 0
| order by FailureRate desc
| project PipelineName, Total, Succeeded, Failed, FailureRate

This gives you the leaderboard of problematic pipelines. Any pipeline with a non-zero failure rate deserves attention.

Copy Activity Throughput — Average MB/s by Pipeline

ADFActivityRun
| where TimeGenerated >= ago(7d)
| where ActivityType == 'Copy'
| where Status == 'Succeeded'
| extend
    DurationSeconds = (End - Start) / 1s,
    DataReadMB = todouble(Output.dataRead) / 1048576,
    DataWrittenMB = todouble(Output.dataWritten) / 1048576
| where DurationSeconds > 0
| summarize
    AvgReadMBps = round(avg(DataReadMB / DurationSeconds), 2),
    AvgWriteMBps = round(avg(DataWrittenMB / DurationSeconds), 2),
    TotalRunsMB = sum(DataReadMB),
    RunCount = count()
    by PipelineName, ActivityName
| order by AvgReadMBps desc

Identifies bottleneck copy activities and validates expected throughput for different source/sink combinations.

Longest-Running Activities — Current Week

ADFActivityRun
| where TimeGenerated >= startofweek(now())
| where Status == 'Succeeded'
| extend DurationMinutes = round((End - Start) / 1m, 1)
| top 20 by DurationMinutes desc
| project
    PipelineName,
    ActivityName,
    ActivityType,
    DurationMinutes,
    Start

Good for identifying SLA risks before they become incidents. If a Copy Activity that normally takes 8 minutes is showing up here at 45 minutes, something changed.

Failed Activities by Error Category

ADFActivityRun
| where TimeGenerated >= ago(7d)
| where Status == 'Failed'
| extend ErrorMessage = tostring(Error.message)
| extend ErrorCategory = case(
    ErrorMessage contains 'timeout', 'Timeout',
    ErrorMessage contains 'connection', 'Connection',
    ErrorMessage contains 'authentication', 'Authentication',
    ErrorMessage contains 'throttl', 'Throttling',
    ErrorMessage contains 'not found', 'NotFound',
    'Other'
)
| summarize Count = count() by ErrorCategory, ActivityType
| order by Count desc

Categorizes errors so you can prioritize fixes. If 80% of your failures are connection errors, that's a different investigation than if 80% are throttling errors.

The Custom Audit Table Pattern

In addition to Azure Monitor logs, I run a stored procedure activity at the end of each pipeline that writes to a custom audit table:

CREATE TABLE etl.PipelineAudit (
    AuditId         BIGINT IDENTITY(1,1) PRIMARY KEY,
    PipelineName    NVARCHAR(256) NOT NULL,
    RunId           NVARCHAR(36) NOT NULL,
    StartTime       DATETIME2 NOT NULL,
    EndTime         DATETIME2 NOT NULL,
    DurationSeconds INT NOT NULL,
    RowsRead        BIGINT NULL,
    RowsWritten     BIGINT NULL,
    Status          NVARCHAR(50) NOT NULL,
    ErrorMessage    NVARCHAR(MAX) NULL,
    CreatedAt       DATETIME2 NOT NULL DEFAULT SYSUTCDATETIME()
);

The pipeline passes @pipeline().RunId, @pipeline().TriggerTime, and activity output metrics as stored procedure parameters. This audit table is queryable from Power BI, T-SQL, or any tool that talks to Azure SQL. It's business-facing visibility in a format that business stakeholders can query directly, without needing access to Log Analytics.

The Azure Monitor Workbook

Azure Monitor Workbooks let you build custom dashboards from KQL queries. I've built a workbook that includes:

  • 24-hour pipeline health tile (Succeeded / Failed / Running counts)
  • 7-day failure rate by pipeline (bar chart)
  • Average Copy Activity throughput trend (line chart)
  • Current running pipelines and their duration so far
  • Recent failure details with error messages (table)

This workbook is the first thing I open for an ADF environment status check. Bookmark it. Share it with the team. Put it on a TV in the operations area if you have one.

What's Still Missing

Native ADF alerting without Azure Monitor configuration. Today, setting up an alert for "pipeline X failed more than 3 times in the last hour" requires creating an Azure Monitor alert rule pointing at a Log Analytics query. That works, but it's not discoverable from the ADF UI. A data engineer looking at the ADF Monitor view shouldn't need to navigate to Azure Monitor to configure alerts on their own pipelines.

In-progress row counts for running Copy Activities. The Monitor view shows you that a Copy Activity has been running for 12 minutes. It does not show you how many rows have been copied so far. Is it 10% done or 90% done? You don't know until it finishes. This makes it hard to project completion time for long-running jobs.

These gaps are solvable with workarounds (the audit table pattern gets you progress metrics if you update it mid-run, which requires staging logic). But they're gaps in the native experience that Microsoft should address.

The monitoring story in 2019 is workable. It's not turnkey. You have to invest an afternoon building the Log Analytics integration and the workbook. That afternoon is worth it. As always, if you want the workbook template or help setting up the alerting rules, I'm here to help.

Read more