Database Health Checks When You Can't Touch the Box

A database health check used to mean RDP-ing to the server, checking Windows event logs, reviewing SQL Agent job history, running sp_Blitz, and having a look at perfmon counters while you were there. I ran these for clients regularly. Moving to Azure removed half of those touchpoints. The box was gone — or rather, the box was there, but it belonged to Microsoft and they weren't letting me in.

What I developed instead was a structured health check process built entirely from what SQL Azure exposed: DMVs, system catalog views, and the Azure portal. It was different from the on-prem process, but once I'd rebuilt the checklist, it was actually more repeatable — because it ran as queries rather than relying on what someone remembered to click through in a GUI.

The SQL Azure Health Check Checklist

This is the actual query sequence I ran during health checks. Each section has a clear signal: run the query, interpret the result, act if the threshold is crossed.

1. DTU Utilization Trend

DTU (Database Transaction Unit) is SQL Azure's blended measure of CPU, I/O, and memory. Pull the recent trend from your monitoring data or the portal. Sustained utilization above 80% means you're approaching the tier ceiling — either optimize the top queries or scale up the tier before users start noticing.

2. Top Queries by Resource Consumption

-- Run this, review the top 10, compare to last health check
SELECT TOP 10
    qs.execution_count,
    qs.total_elapsed_time / qs.execution_count / 1000.0 AS avg_elapsed_ms,
    qs.total_logical_reads / qs.execution_count AS avg_logical_reads,
    qs.total_worker_time / qs.execution_count / 1000.0 AS avg_cpu_ms,
    SUBSTRING(st.text, (qs.statement_start_offset/2)+1,
        ((CASE qs.statement_end_offset WHEN -1 THEN DATALENGTH(st.text)
          ELSE qs.statement_end_offset END - qs.statement_start_offset)/2)+1) AS query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
ORDER BY total_logical_reads DESC;

New query at the top since last check? Find out who wrote it and when it started running. That's almost always the explanation for a sudden DTU spike.

3. Missing Index Candidates

Run the missing index query from the monitoring post. Any index with an improvement measure above 1,000 and user_seeks above 100 is a candidate to evaluate. Don't auto-create every missing index — evaluate the columns, the table size, and whether the index would help more than the write overhead it creates.

4. Unused Indexes

-- Indexes that are being maintained but never used for reads
SELECT 
    OBJECT_NAME(i.object_id) AS table_name,
    i.name AS index_name,
    i.type_desc,
    ius.user_seeks + ius.user_scans + ius.user_lookups AS total_reads,
    ius.user_updates AS total_writes
FROM sys.indexes i
LEFT JOIN sys.dm_db_index_usage_stats ius 
    ON i.object_id = ius.object_id AND i.index_id = ius.index_id
    AND ius.database_id = DB_ID()
WHERE i.type_desc <> 'HEAP'
  AND i.is_primary_key = 0
  AND i.is_unique_constraint = 0
  AND (ius.user_seeks + ius.user_scans + ius.user_lookups) = 0
  AND ius.user_updates > 100
ORDER BY ius.user_updates DESC;

An index that has zero reads and thousands of writes is pure overhead. Candidates for removal — but verify with the application owner before dropping anything.

5. Table Sizes and Growth

-- Table sizes — track these between health checks for unexpected growth
SELECT 
    t.name AS table_name,
    p.rows AS row_count,
    SUM(a.total_pages) * 8 / 1024.0 AS total_mb,
    SUM(a.used_pages) * 8 / 1024.0 AS used_mb,
    (SUM(a.total_pages) - SUM(a.used_pages)) * 8 / 1024.0 AS unused_mb
FROM sys.tables t
JOIN sys.indexes i ON t.object_id = i.object_id
JOIN sys.partitions p ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.allocation_units a ON p.partition_id = a.container_id
WHERE t.is_ms_shipped = 0
GROUP BY t.name, p.rows
ORDER BY total_mb DESC;

6. Database Size vs Tier Limit

SQL Azure databases had hard size limits per tier. A database approaching its tier limit would start throwing errors. Check sys.database_files or the portal's database size metric and make sure you have at least 20% headroom.

The Health Check as a Consulting Deliverable

What changed when health checks became query-driven: I could run them remotely, from anywhere, without scheduling an on-site visit. The results were reproducible — the same queries run by different people produced the same data. And I could track trends over time by saving the results, which revealed gradual degradation patterns that a one-time snapshot missed.

The health check became a service I could offer on a recurring schedule — monthly, quarterly, or when clients reported performance complaints — with a consistent output format. The prescriptive actions that followed were grounded in specific query data rather than in whatever I happened to notice during an on-site session.

If you're running SQL Azure environments for clients and your health check process is still "log in and look around," this is worth formalizing. The repeatability is worth more than you expect. As always, I'm here to help.

Read more