DMVs Without the Hardware: What SQL Azure Monitoring Looks Like

The first time I needed to diagnose a slow query on a SQL Azure database, I opened SSMS and reached for sys.dm_os_wait_stats the way I always did. The query ran. The results were there. And then I spent ten minutes staring at wait stats that made no sense before realizing I'd been looking at cumulative waits since the last replica failover — which had happened six hours ago and reset the baseline I'd always taken for granted.

Monitoring SQL Azure requires unlearning some on-prem habits and building new ones. Here's what the DMV landscape actually looks like in 2012 and what to do about the gaps.

The DMVs That Work As Expected

Start with what you have. These DMVs work in SQL Azure and give you the same information as on-prem:

-- Active sessions and requests
SELECT 
    r.session_id,
    r.status,
    r.wait_type,
    r.wait_time / 1000.0 AS wait_seconds,
    r.total_elapsed_time / 1000.0 AS elapsed_seconds,
    r.cpu_time / 1000.0 AS cpu_seconds,
    r.logical_reads,
    DB_NAME(r.database_id) AS database_name,
    SUBSTRING(t.text, (r.statement_start_offset/2)+1,
        ((CASE r.statement_end_offset WHEN -1 THEN DATALENGTH(t.text)
          ELSE r.statement_end_offset END - r.statement_start_offset)/2)+1) AS statement_text
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
WHERE r.session_id > 50
  AND r.session_id <> @@SPID;

-- Top queries by total elapsed time (your query performance baseline)
SELECT TOP 25
    qs.execution_count,
    qs.total_elapsed_time / 1000.0 AS total_elapsed_ms,
    qs.total_elapsed_time / qs.execution_count / 1000.0 AS avg_elapsed_ms,
    qs.total_logical_reads,
    qs.total_logical_reads / qs.execution_count AS avg_logical_reads,
    SUBSTRING(st.text, (qs.statement_start_offset/2)+1,
        ((CASE qs.statement_end_offset WHEN -1 THEN DATALENGTH(st.text)
          ELSE qs.statement_end_offset END - qs.statement_start_offset)/2)+1) AS query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
ORDER BY total_elapsed_time DESC;

These work. Use them.

The Wait Stats Problem

On-prem, your standard practice was to capture wait stats, wait a defined interval (say, 15 minutes), capture again, and subtract to get the waits that accumulated during your observation window. This works because you know when the SQL Server instance last restarted and can use that as your baseline.

In SQL Azure, a replica failover resets sys.dm_os_wait_stats without you knowing it happened. If a failover occurred between your two samples, your delta calculation includes counts from before the failover mixed with counts from after — the result is meaningless.

The workaround I used: capture wait stats on a schedule (every 5 minutes) to an Azure Table Storage account, including a timestamp. When analyzing, look for discontinuities — sudden drops in cumulative wait counts — which indicate a reset happened. Discard deltas that span a reset. It's more work than on-prem but it produces reliable data.

-- Capture wait stats to your monitoring table
INSERT INTO monitoring.WaitStatsSamples
    (CapturedAt, WaitType, WaitingTasksCount, WaitTimeMs, MaxWaitTimeMs)
SELECT 
    GETUTCDATE(),
    wait_type,
    waiting_tasks_count,
    wait_time_ms,
    max_wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
    'SLEEP_TASK','WAITFOR','BROKER_TO_FLUSH','BROKER_TASK_STOP',
    'CLR_AUTO_EVENT','DISPATCHER_QUEUE_SEMAPHORE','FT_IFTS_SCHEDULER_IDLE_WAIT',
    'HADR_WORK_QUEUE','HADR_FILESTREAM_IOMGR_IOCOMPLETION','HADR_FILESTREAM_IOMGR_IORETRY',
    'LAZYWRITER_SLEEP','LOGMGR_QUEUE','ONDEMAND_TASK_QUEUE','REQUEST_FOR_DEADLOCK_MONITOR',
    'RESOURCE_QUEUE','SERVER_IDLE_CHECK','SLEEP_DBSTARTUP','SLEEP_DBRECOVER',
    'SLEEP_TEMPDBSTARTUP','SNI_HTTP_ACCEPT','SP_SERVER_DIAGNOSTICS_SLEEP',
    'SQLTRACE_BUFFER_FLUSH','WAIT_XTP_OFFLINE_CKPT_NEW_LOG','XE_DISPATCHER_WAIT',
    'XE_TIMER_EVENT','BROKER_EVENTHANDLER','CHECKPOINT_QUEUE','DBMIRROR_EVENTS_QUEUE',
    'SQLTRACE_INCREMENTAL_FLUSH_SLEEP','WAIT_XTP_CKPTSLOW'
);

Index and Statistics Monitoring

Index usage stats and missing index DMVs work as expected in SQL Azure and are particularly valuable since you can't run the Database Engine Tuning Advisor against a managed service the same way you could against on-prem SQL Server.

-- Missing indexes (what the query optimizer wishes existed)
SELECT 
    mig.index_group_handle,
    mid.index_handle,
    migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans) AS improvement_measure,
    'CREATE INDEX ix_' + OBJECT_NAME(mid.object_id) + '_missing_' + 
        CAST(mig.index_group_handle AS VARCHAR) + 
        ' ON ' + mid.statement + 
        ' (' + ISNULL(mid.equality_columns, '') + 
        CASE WHEN mid.equality_columns IS NOT NULL AND mid.inequality_columns IS NOT NULL THEN ',' ELSE '' END +
        ISNULL(mid.inequality_columns, '') + ')' + 
        ISNULL(' INCLUDE (' + mid.included_columns + ')', '') AS create_index_statement,
    migs.user_seeks,
    migs.user_scans,
    migs.avg_total_user_cost,
    migs.avg_user_impact
FROM sys.dm_db_missing_index_groups mig
JOIN sys.dm_db_missing_index_group_stats migs ON mig.index_group_handle = migs.group_handle
JOIN sys.dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle
WHERE migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans) > 10
ORDER BY improvement_measure DESC;

What's Not Available

A short list of what you can't see in SQL Azure that you could see on-prem in 2012: OS-level memory and CPU via sys.dm_os_sys_info, I/O at the file level via sys.dm_io_virtual_file_stats, SQL Server process memory details via sys.dm_os_memory_clerks (partial at best), and anything related to SQL Server Agent since it doesn't exist in the managed service.

The Azure portal gave you DTU percentage, database size, and connection counts. Coarse but useful for trending and alerting. Set up portal alerts for DTU over 80% — that's your signal to tune queries or scale the tier, and it's much better to be alerted than to find out via a performance complaint.

The monitoring gaps were real in 2012 and they drove a lot of customers to SQL Server on IaaS VMs instead of the managed service. If you needed deep visibility into memory pressure, I/O patterns, or wait statistics with a stable baseline, a VM gave you all of that. The managed service asked you to trust the platform more and monitor through a narrower window. For workloads where that tradeoff was acceptable — which was most of what I was running for small business clients — it was fine. For demanding production workloads with complex performance characteristics, the visibility gap was a real constraint. As always, I'm here to help.

Read more