HA and DR for SQL Server on Azure: AlwaysOn AGs Across Availability Sets

One of my resume items from this period is "Designed and Implemented High Availability and Disaster Recovery Plans for SQL Server." In the Azure IaaS context, this meant SQL Server AlwaysOn Availability Groups running across Azure Availability Sets. It sounds straightforward until you start working through the Azure-specific constraints that don't exist in a physical data center.

Why On-Prem HA Knowledge Doesn't Transfer Directly

AlwaysOn AGs in a physical data center require a Windows Server Failover Cluster (WSFC). WSFC uses heartbeat UDP traffic between nodes to detect failures. In a physical network, this works without special configuration. In Azure's virtual network, UDP multicast — which WSFC uses for node discovery — wasn't supported in 2013. You needed a different approach to get cluster heartbeat working.

The supported pattern was a File Share Witness: a third Azure VM (small, cheap) running a file share that the WSFC nodes used as a quorum tiebreaker. This replaced the disk quorum model that was common on-prem and didn't require multicast traffic.

Availability Sets: The Azure HA Foundation

Azure Availability Sets were the mechanism for ensuring your VMs didn't all land on the same physical host (and thus didn't all fail simultaneously if that host went down). An Availability Set guaranteed that VMs in the set were distributed across:

  • Fault domains — physical racks with separate power and networking. Two VMs in different fault domains won't both fail if a rack power supply dies.
  • Update domains — groups that Azure patches sequentially rather than simultaneously. Azure won't reboot all VMs in an Availability Set at the same time for maintenance.

For an AlwaysOn AG, both nodes needed to be in the same Availability Set but in different fault domains. This was the Azure-equivalent of "put the primary and secondary on different physical racks."

The Internal Load Balancer Requirement

On-prem, an AlwaysOn AG Listener used a virtual IP on the Windows cluster — the cluster moved the VIP to whatever node was primary, and applications connected to the VIP. In Azure, VMs didn't support floating IPs the same way. The solution was an Azure Internal Load Balancer (ILB) in front of the AG Listener.

The ILB directed traffic to the current AG primary using a health probe: a TCP probe that the AG primary responded to and the secondary did not. When a failover occurred, the secondary became the new primary, started responding to the health probe, and the ILB updated its routing. Client applications connected to the ILB's stable IP — they were insulated from the failover.

-- AG Listener configuration (PowerShell, on the primary node)
# After creating the WSFC and AG, add the listener
Add-SqlAvailabilityGroupListener `
    -Name "AGListener" `
    -StaticIp "10.0.0.100/255.255.255.0" `
    -Port 1433 `
    -Path "SQLSERVER:\SQL\SQLPrimary\DEFAULT\AvailabilityGroups\MyAG"

# The ILB is configured in Azure portal to front-end this static IP
# with a TCP health probe on port 59999 (a convention for AG probes)

The Geo-Redundancy Add-On

For DR (not just HA), you needed an asynchronous replica in a second Azure region. This was technically the same AG configuration extended to three replicas: two synchronous in the primary region (for HA with no data loss), one asynchronous in the secondary region (for DR with potential data loss equal to replication lag).

The complication: cross-region VM networking in Azure in 2013 required a VPN gateway connecting the two virtual networks. Setting up the VPN gateway, configuring the routing tables, and ensuring the AG replication traffic could flow between regions added a layer of complexity to the initial deployment and a latency variable to the replication lag that on-prem DR didn't have.

What It Cost and When It Was Worth It

An AlwaysOn AG on Azure in 2013 required: minimum 3 VMs (primary, secondary, file share witness), Premium Storage for each data-bearing VM, a VPN gateway for cross-region DR, and the SQL Server Enterprise licenses that AlwaysOn AGs required. This was not cheap.

For workloads where downtime had measurable business cost — processing delays, contractual SLA penalties, customer-facing availability commitments — the cost was justified. For workloads where a 30-minute recovery from backup was acceptable, SQL Azure's built-in HA was a better answer at a fraction of the cost.

The skill in designing HA/DR for a client wasn't just knowing how to configure AlwaysOn AGs — it was helping the client understand what RTO and RPO they actually needed versus what they said they needed, and designing the minimum architecture that met the real requirement. Most clients asked for "five nines" and actually needed "we can recover in an hour." Those are different architectures with very different price tags. Get the requirements right before you design the solution. As always, I'm here to help.

Read more