Azure Cosmos DB is a globally distributed, multi-model NoSQL database that provides low-latency and high-throughput access to data. When developing applications in Go that interact with Cosmos DB using the Azure SDK for Go, it’s critical to validate how your application behaves under error conditions—such as throttling, timeouts, or network failures.

In this article, you’ll learn how to simulate errors using custom transports and retry policies in Go, and how to test and improve your error-handling and retry logic using these tools.

Understanding Cosmos DB SDK for Go

The Go SDK for Azure Cosmos DB is built atop the Azure Core pipeline, which allows custom HTTP pipeline behaviors via:

  • Transport Layer Injection: Customize how HTTP requests are made (e.g., simulate network failures).

  • Retry Policy Injection: Define how retries behave when requests fail (e.g., simulate throttling retries).

We’ll leverage both techniques to simulate controlled error scenarios for robust integration testing.

Setup and Dependencies

To follow along, you need:

  • Go 1.20+

  • Azure SDK for Go

  • A Cosmos DB account (optional for real testing)

Install the required packages:

bash
go get github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos

If you want to test locally without an actual Cosmos DB instance, you can simulate responses using HTTP transport mocking.

Creating the Cosmos DB Client

Here’s a basic setup for connecting to Cosmos DB using the SDK:

go

package main

import (
“context”
“fmt”
“log”
“os”

“github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos”
)

func createClient() *azcosmos.Client {
endpoint := os.Getenv(“COSMOS_ENDPOINT”)
key := os.Getenv(“COSMOS_KEY”)

cred, err := azcosmos.NewKeyCredential(key)
if err != nil {
log.Fatalf(“Failed to create credential: %v”, err)
}

client, err := azcosmos.NewClientWithKey(endpoint, cred, nil)
if err != nil {
log.Fatalf(“Failed to create Cosmos DB client: %v”, err)
}
return client
}

Simulating Errors Using Custom Transports

To simulate errors like throttling (HTTP 429), timeouts, or service failures, override the default transport with a custom implementation.

Example: Simulating Throttling (HTTP 429)

go

type ThrottlingTransport struct{}

func (t *ThrottlingTransport) RoundTrip(req *http.Request) (*http.Response, error) {
// Simulate a 429 Too Many Requests error
resp := &http.Response{
StatusCode: 429,
Status: “429 Too Many Requests”,
Body: io.NopCloser(strings.NewReader(“Rate limit exceeded”)),
Header: make(http.Header),
}
resp.Header.Set(“Retry-After”, “1”)
return resp, nil
}

Integrate it into the Cosmos DB client:

go
func createClientWithThrottlingTransport() *azcosmos.Client {
endpoint := os.Getenv("COSMOS_ENDPOINT")
key := os.Getenv("COSMOS_KEY")
cred, _ := azcosmos.NewKeyCredential(key)opts := &azcosmos.ClientOptions{
Transport: &ThrottlingTransport{},
}client, err := azcosmos.NewClientWithKey(endpoint, cred, opts)
if err != nil {
log.Fatalf(“Error creating Cosmos client with transport: %v”, err)
}
return client
}

Now, all operations will simulate throttling responses.

Custom Retry Policies

The Azure SDK allows injecting a custom retry policy. This gives you the ability to simulate retries on certain status codes or transient network errors.

Example: Logging Retry Attempts

go
type CustomRetryPolicy struct {
azcosmos.RetryOptions
}
func (p *CustomRetryPolicy) ShouldRetry(ctx context.Context, request *http.Request, response *http.Response, err error) (azcosmos.RetryDecision, error) {
if response != nil && response.StatusCode == 429 {
fmt.Println(“Simulated retry due to 429 throttling”)
return azcosmos.RetryDecision{
ShouldRetry: true,
RetryAfter: time.Second * 1,
}, nil
}
return azcosmos.RetryDecision{ShouldRetry: false}, nil
}

Integrate it like this:

go
opts := &azcosmos.ClientOptions{
Retry: &CustomRetryPolicy{},
}
client, err := azcosmos.NewClientWithKey(endpoint, cred, opts)

This will print a log on every retry caused by a simulated 429.

End-to-End Retry Test Example

Let’s bring it together with a test that verifies retry behavior in response to simulated throttling:

go
func simulateInsertAndRetry(client *azcosmos.Client, dbName, containerName string) {
ctx := context.TODO()
item := map[string]interface{}{
“id”: “12345”,
“name”: “Simulated Item”,
}
itemData, _ := json.Marshal(item)pk := azcosmos.NewPartitionKeyString(“12345”)
container, err := client.NewContainer(dbName, containerName)
if err != nil {
log.Fatalf(“Failed to get container: %v”, err)
}_, err = container.CreateItem(ctx, pk, itemData, nil)
if err != nil {
fmt.Printf(“Request failed as expected: %v\n”, err)
} else {
fmt.Println(“Unexpected success: retries worked.”)
}
}

If you’re using the ThrottlingTransport, you should observe retry behavior depending on the retry policy you set.

Simulating Timeouts and Network Errors

Here’s a transport that mimics timeouts:

go

type TimeoutTransport struct{}

func (t *TimeoutTransport) RoundTrip(req *http.Request) (*http.Response, error) {
return nil, context.DeadlineExceeded
}

You can test how your application behaves on client-side timeouts by injecting this transport.

Observing Retry Metrics

To fully observe retries:

  • Use context with deadline to bound total retry time.

  • Instrument metrics or logs inside the custom retry policy to track retries, delays, and failures.

Example:

go
type CountingRetryPolicy struct {
azcosmos.RetryOptions
RetryCount int
}
func (p *CountingRetryPolicy) ShouldRetry(ctx context.Context, req *http.Request, resp *http.Response, err error) (azcosmos.RetryDecision, error) {
p.RetryCount++
log.Printf(“Retry #%d”, p.RetryCount)
return azcosmos.RetryDecision{
ShouldRetry: true,
RetryAfter: 500 * time.Millisecond,
}, nil
}

Unit Testing With Simulated Errors

Use Go’s httptest server to simulate full Cosmos DB endpoints with custom responses:

go
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(429)
w.Header().Set("Retry-After", "1")
fmt.Fprintln(w, "Simulated throttle")
}))
defer server.Close()
// Inject server.URL as the endpoint for testing.

This is useful for mocking full lifecycle behaviors in CI/CD pipelines.

Best Practices for Error Handling

  1. Fail Fast with Budgeted Retries: Always set max retry count or total time budget.

  2. Log & Track Retry Attempts: Helps detect hidden latencies in production.

  3. Test Throttling Logic: Cosmos DB imposes RU quotas; test for graceful degradation.

  4. Use Exponential Backoff: Azure SDK supports this by default, but tune it via options.

Full Example: Throttling + Retry

go
func main() {
client := createClientWithThrottlingTransport()
simulateInsertAndRetry(client, "MyDatabase", "MyContainer")
}

This runs your application logic against simulated throttling errors, invoking the retry mechanism as designed.

Conclusion

Simulating errors in distributed systems—especially cloud-native services like Azure Cosmos DB—is not just a best practice; it is a critical necessity for building fault-tolerant, production-grade applications. In the context of Go, the Azure SDK provides powerful mechanisms for achieving this through custom transports and custom retry policies. By learning how to inject controlled failures such as HTTP 429 throttling, timeouts, or network disconnects, developers can verify and fine-tune their retry logic, failure recovery strategies, and observability pipelines.

This approach moves error handling from an abstract concept into a concrete, testable behavior. Rather than waiting for errors to occur in production, developers can now proactively simulate them during integration and CI/CD testing phases. This not only increases system reliability but also builds confidence in how services will respond under real-world cloud pressure—when quotas are hit, services are throttled, or regional outages occur.

Using the techniques shown—such as mocking HTTP transports to return custom responses or injecting retry policies that count or log attempts—you can:

  • Validate if your retry logic kicks in under Cosmos DB throttling (HTTP 429) and recovers gracefully.

  • Measure how your system behaves under repeated failures and whether exponential backoff is tuned appropriately.

  • Ensure that deadlines and timeouts are respected, preventing requests from hanging indefinitely.

  • Simulate infrastructure-level issues like dropped connections or transient DNS errors, which are hard to reproduce otherwise.

In addition to resilience testing, this methodology supports observability and maintainability. Custom retry policies can emit metrics (retry count, total wait time, failure types), which feed into dashboards and alerting systems. Over time, this helps teams detect trends in operational issues and optimize retry budgets or partition key strategies to reduce hotspots and contention.

On a broader level, this strategy aligns with modern chaos engineering principles—introducing controlled disruption to learn how systems behave under stress. While chaos engineering is often associated with large-scale distributed systems, applying these ideas at the SDK and transport layer gives even smaller Go applications a robust foundation against unpredictable cloud behavior.

To summarize, simulating Cosmos DB errors in Go enables developers to:

  • Test retry logic deterministically without relying on actual cloud failure conditions.

  • Design more resilient applications that can gracefully handle throttling, timeouts, and transient faults.

  • Shorten feedback loops by integrating simulated errors into test suites and CI/CD pipelines.

  • Improve system observability, giving engineering teams data to act on and optimize.

In today’s era of distributed computing, preparing for failure is no longer optional—it’s expected. The patterns and techniques covered in this article empower you to write Go applications that are not only functional but resilient, observable, and cloud-native by design. By leveraging the flexibility of the Azure SDK for Go and taking full control over HTTP transport and retry behavior, you lay the foundation for systems that can withstand and recover from real-world failure conditions, gracefully and predictably.

Now is the time to bring resilience testing into your development workflow. Start small by simulating a throttled response. Gradually expand into complex error scenarios. And soon, your Go applications will be ready to thrive—even when the cloud pushes back.