Scaling Out with Azure VMs: When One SQL Server Isn't Enough
At some point in almost every growing Azure deployment, the conversation shifted from "how do we make this SQL Server more available" to "how do we make it handle more load." Scale up — bigger VM size, more memory — was the first lever. It was often enough. But it had a ceiling: there were only so many VM sizes, and the biggest VM available in 2013 was still one server. When the workload outgrew one server, scale-out was the answer, and scale-out for SQL Server required a deliberate architectural choice.
Read Replicas: The Lowest-Risk Scale-Out
AlwaysOn AG secondary replicas could be configured for read-only access. If your workload was read-heavy — reporting queries, dashboards, analytics against the operational database — you could offload reads to the secondary while writes continued against the primary. This was effectively free: you already had the secondary for HA purposes, and enabling read access cost nothing additional.
The catch: you had to explicitly route read workloads to the secondary. Application connection strings needed to specify ApplicationIntent=ReadOnly, and the AG Listener needed to be configured to route ReadOnly connections to the secondary. This didn't happen automatically.
-- Connection string for read-only workloads
Server=AGListenerName,1433;
Database=YourDatabase;
ApplicationIntent=ReadOnly;
User ID=reportuser;
Password=...;
-- On the secondary, verify read-only routing is configured
SELECT replica_server_name, read_only_routing_url
FROM sys.availability_read_only_routing_lists;This pattern worked well for separating OLTP writes from reporting reads. It didn't help if the write load itself was the bottleneck.
Partitioning the Write Workload: Sharding
When write throughput exceeded what a single SQL Server could handle, the options were limited: sharding (distributing data across multiple SQL Server instances by a partition key) or moving to a different data store designed for horizontal write scale.
Sharding in SQL Azure was manual in 2012-2013. Microsoft later released the Elastic Database client library to help, but in the early era, sharding required application-level logic to route writes and reads to the correct shard based on a partition key — typically customer ID, tenant ID, or date range.
The classic sharding pattern:
- Choose a partition key that distributes load evenly (avoid hotspots — don't shard by status if 90% of records are in the same status)
- Build or use a shard map: a lookup table that maps partition key ranges to database endpoints
- Route all data access through a data access layer that consults the shard map and directs the operation to the correct database
- Handle cross-shard queries with fan-out (query all shards, aggregate results in the application layer) — expensive, minimize these
Sharding was and is complex. The operational overhead of managing multiple database instances, the complexity of cross-shard queries, and the limitations of transactions that span shards are real costs. Sharding made sense for workloads that genuinely couldn't fit on one server. For workloads that fit with proper indexing and query optimization, sharding was technical debt masquerading as engineering.
When to Consider a Different Store
Some scale-out problems were better solved by reconsidering the data store rather than scaling the existing one. If the write bottleneck was specifically a high-velocity event stream — user activity, sensor data, log ingestion — SQL Server on Azure VMs was not the right store at scale. Azure Table Storage, Azure Event Hubs, or DocumentDB handled those patterns more naturally and at lower cost than sharding SQL Server.
The evaluation question: is the bottleneck in the relational engine itself, or in the volume of data arriving? If it's the relational engine (complex queries, large transactions, referential integrity overhead), optimizing SQL Server was the right path. If it's pure volume of simple writes, a different store was worth evaluating seriously.
Scale-out decisions are architecture decisions, and architecture decisions made under load pressure are frequently wrong. Make the scale-out plan before you need it, not when you're in a performance incident at 2 AM. As always, I'm here to help.