Writing Code for a Network You Don't Own: Retry Policies and Connection Resiliency
SQL Azure will drop your connection. Not because something is wrong — because it's a shared service running on infrastructure that moves, fails, patches, and recovers automatically. If your application isn't written to handle transient connection failures, you're going to have a bad time in production. This post is the practical version: here's what transient failures look like, here's the retry pattern, here's how to wire it up.
What "Transient" Actually Means
A transient failure is one that will resolve on its own if you wait a moment and try again. SQL Azure has several categories:
- Throttling errors — your database tier is under load and SQL Azure is telling you to back off. Error 40501: "The service is currently busy." Usually resolves in 10–30 seconds.
- Connection termination — the service dropped your idle connection. Error 10054: "An existing connection was forcibly closed by the remote host." Retry immediately.
- Replica failover — the replica your connection was on is being failed over. Brief window where new connections succeed but existing ones are dropped. Usually resolves in under a minute.
- Resource limit — you've hit the concurrent connection or DTU limit for your tier. Error 10928/10929. Retry with backoff; if it persists, you need to scale your tier.
The key characteristic of all of these: they are temporary conditions. Retrying after a short wait succeeds. Permanent failures — wrong password, insufficient permissions, database doesn't exist — are not transient and should not be retried.
The Transient Fault Handling Application Block
Microsoft shipped the Transient Fault Handling Application Block as part of the Enterprise Library in 2011, specifically for SQL Azure scenarios. It provided two things you needed: a detection strategy (which errors are transient?) and a retry strategy (how long do you wait between attempts?).
// NuGet: Microsoft.WindowsAzure.Configuration + EnterpriseLibrary.TransientFaultHandling.Data
using Microsoft.Practices.EnterpriseLibrary.TransientFaultHandling;
// Build the retry policy once at application startup
var retryStrategy = new ExponentialBackoff(
retryCount: 5,
minBackoff: TimeSpan.FromSeconds(1),
maxBackoff: TimeSpan.FromSeconds(30),
deltaBackoff: TimeSpan.FromSeconds(2));
var retryPolicy = new RetryPolicy<SqlDatabaseTransientErrorDetectionStrategy>(retryStrategy);
// Wrap every database operation
retryPolicy.ExecuteAction(() =>
{
using (var connection = new ReliableSqlConnection(connectionString, retryPolicy))
{
connection.Open();
using (var command = connection.CreateCommand())
{
command.CommandText = "SELECT COUNT(*) FROM dbo.Orders WHERE CustomerId = @id";
command.Parameters.AddWithValue("@id", customerId);
return (int)command.ExecuteScalar();
}
}
});ReliableSqlConnection is the block's connection class — it wraps SqlConnection and applies the retry policy to the connection open itself, not just the query execution. Both layers need coverage.
Exponential Backoff: Why It Matters
The retry strategy matters as much as the detection strategy. If you retry immediately on a throttling error, you'll keep hitting the same throttle. If you retry 100 times in a tight loop, you'll make the problem worse by generating load.
Exponential backoff means each retry waits longer than the previous one: 1 second, 2 seconds, 4 seconds, 8 seconds, up to a configured maximum. This gives the service time to recover and prevents your retry logic from being the thing that takes down the database.
Add jitter — a small random offset to the backoff time — if you have multiple clients that might fail simultaneously. Without jitter, all clients back off for exactly the same duration and then all hit the service at exactly the same moment, which recreates the load spike you were trying to avoid. This is known as the thundering herd problem and it shows up in distributed systems more often than you'd expect.
Connection Strings for SQL Azure
Two settings that mattered in 2011 that weren't always in the default connection string templates:
Server=tcp:yourserver.database.windows.net,1433;
Database=YourDatabase;
User ID=youruser@yourserver;
Password=yourpassword;
Trusted_Connection=False;
Encrypt=True;
Connection Timeout=30;Encrypt=True was required for SQL Azure connections — the service wouldn't accept unencrypted connections. Connection Timeout=30 gave enough time for the connection to succeed through occasional latency spikes without waiting indefinitely. The port specification (,1433) was required explicitly because some network configurations would route to the wrong endpoint without it.
Testing Your Retry Logic
Here's the thing about retry logic: it's easy to write code that looks correct and doesn't actually retry. Test it explicitly by introducing a deliberate failure — temporarily change the connection string to an invalid server, verify the retry behavior fires, confirm the error is correctly classified as transient or permanent, and verify the exponential backoff timing is what you expect.
If your retry logic has never been tested in anger, you don't actually know whether it works. Cloud infrastructure will eventually give you a transient failure at 3 AM. You want to know before then whether your code handles it correctly.
This is the kind of defensive infrastructure code that's invisible when it works and catastrophic when it's missing. Build it in, test it deliberately, and update the retry error list when new transient error codes show up in the documentation. As always, I'm here to help.