Skip to content

Commit 7d82cca

Browse files
authored
Merge pull request #4365 from sushantrane/main
Disaster Recovery Guidance for Azure Cosmos DB
2 parents 4978a48 + 0eda449 commit 7d82cca

File tree

6 files changed

+220
-0
lines changed

6 files changed

+220
-0
lines changed

articles/cosmos-db/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -707,6 +707,8 @@
707707
items:
708708
- name: High availability and reliability
709709
href: /azure/reliability/reliability-cosmos-db-nosql?context=/azure/cosmos-db/context/context
710+
- name: Disaster Recovery guidance
711+
href: disaster-recovery-guidance.md
710712
- name: Global distribution
711713
items:
712714
- name: Global distribution overview
Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
---
2+
title: Disaster recovery guidance
3+
titleSuffix: Azure Cosmos DB
4+
description: Learn about disaster recovery guidance when using Azure Cosmos DB, including how to detect outages and recover your data.
5+
author: sushantrane
6+
ms.author: srane
7+
ms.service: azure-cosmos-db
8+
ms.topic: conceptual
9+
ms.date: 02/10/2026
10+
appliesto:
11+
- ✅ NoSQL
12+
- ✅ MongoDB
13+
- ✅ Apache Cassandra
14+
- ✅ Apache Gremlin
15+
- ✅ Table
16+
---
17+
18+
# Disaster recovery guidance for Azure Cosmos DB
19+
20+
Azure Cosmos DB provides industry-leading high availability with a comprehensive suite of built-in business continuity and disaster recovery (BCDR) capabilities. The service offers multiple availability guarantees depending on configuration, with SLAs up to 99.999% for multi-write region accounts or using Per Partition Automatic Failover (PPAF) capability. Azure Cosmos DB also provides turnkey disaster recovery capabilities that enable quick recovery in the event of a regional outage.
21+
22+
Though Azure Cosmos DB continuously strives to provide high availability, the service might occasionally experience outages that cause an account to become unavailable and impact an application. When service monitoring detects widespread connectivity errors, failures, or performance issues, the service automatically declares an outage to keep you informed.
23+
24+
This article provides guidance on preparing for potential service outages and the actions to take during an outage to ensure business continuity.
25+
26+
## Service outage
27+
28+
In the event of an Azure Cosmos DB service outage, you can find details related to the outage in the following places:
29+
30+
### Azure portal banner
31+
32+
If a subscription is identified as impacted, there's an outage alert of a Service Issue in the Azure portal Notifications.
33+
34+
[![Diagram that shows the example of service issue notification in Azure portal.](media/disaster-recovery-guidance/notification-service-issue-example.png)](media/disaster-recovery-guidance/notification-service-issue-example.png#lightbox)
35+
36+
### Help + support or Support + troubleshooting
37+
38+
When you create a support ticket from **Help + support** or **Support + troubleshooting**, there's information about any issues impacting your resources. Select **View outage details** for more information and a summary of impact. There's also an alert in the **New support request** page.
39+
[![Diagram that shows the example of Help + support on Azure portal.](media/disaster-recovery-guidance/help-support-service-health-notification.png)](media/disaster-recovery-guidance/help-support-service-health-notification.png#lightbox)
40+
41+
### Service Health
42+
43+
The **Service Health** page in the Azure portal contains information about Azure data center status globally. Search for `service health` in the search bar in the Azure portal, then view **Service issues** in the **Active events** category. You can also view the health of individual resources in the **Resource health** page of any resource under the **Help** menu.
44+
[![Diagram that shows the example of Service Health on Azure portal.](media/disaster-recovery-guidance/service-health-service-issues-example-map.png)](media/disaster-recovery-guidance/service-health-service-issues-example-map.png#lightbox)
45+
46+
### Email notification
47+
48+
If alerts are configured, an email notification is sent from `azure-noreply@microsoft.com` when a service outage impacts a subscription and resource. For more information on service health alerts, see [Receive activity log alerts on Azure service notifications using Azure portal](/azure/service-health/alerts-activity-log-service-notifications-portal).
49+
50+
### Service metrics
51+
52+
You can [monitor and configure alerts for Azure Cosmos DB availability metrics](monitor.md) in the Azure portal. Azure Cosmos DB provides comprehensive metrics for monitoring availability, request rates, request units consumption, and storage.
53+
54+
## When to initiate disaster recovery during an outage
55+
56+
In the event of a service outage impacting application resources, consider the following courses of action:
57+
58+
- Azure teams work diligently to restore service availability as quickly as possible, but depending on the root cause, recovery can sometimes take longer. If an application can tolerate downtime, wait for the recovery to complete. In this case, no action is required. View the health of individual resources on the **Resource health** page under the **Help** menu. Refer to the Resource health page for updates and the latest information about an outage. After the region recovers, application availability is restored.
59+
60+
- If the outage duration approaches your RTO, decide whether to wait for service recovery or initiate disaster recovery. Depending on the application's tolerance for downtime and potential business liability, make an informed decision about how to respond to prolonged unavailability.
61+
62+
## Outage recovery guidance
63+
64+
The recovery approach for an Azure Cosmos DB account depends on the account configuration. This section provides detailed guidance based on different account types and outage scenarios.
65+
66+
## Recovery options by account configuration
67+
68+
The following table summarizes the recovery options available based on the Azure Cosmos DB account configuration and the type of outage:
69+
70+
| Account configuration | Outage scenario | Recovery approach | Section reference |
71+
|---|---|---|---|
72+
| Single-region account | Region outage | Wait for service restoration or request account restore from backup to another region. | [Single-region accounts](#single-region-accounts) |
73+
| Single-write region, multiple-region account | Read region outage | SDK reroutes to available regions based on configuration; consider taking the region offline for strong consistency in two-region accounts. | [Read region outage](#read-region-outage) |
74+
| Single-write region, multiple-region account | Write region outage (with PPAF enabled) | Automatic partition-level failover. | [Accounts enabled with per-partition automatic failover](#accounts-enabled-with-per-partition-automatic-failover-ppaf-preview) |
75+
| Single-write region, multiple-region account | Write region outage (without PPAF) | Perform offline region operation. | [Region offline operation](#region-offline-operation)|
76+
| Multiple-write region account | Any region outage | Automatic routing to healthy regions via SDK configurations; no manual intervention required. | [Multiple-write region accounts](#multiple-write-region-accounts) |
77+
| Any account configuration | Data corruption or accidental deletion | Point-in-time restore (continuous backup) or restore from periodic backup. | [Continuous backup and point-in-time restore](#continuous-backup-and-point-in-time-restore), [Periodic backup and restore](#periodic-backup-and-restore) |
78+
79+
---
80+
### Single-region accounts
81+
82+
A single-region account with **Availability Zones** can maintain read-write availability when an outage affects only one availability zone. However, if multiple availability zones or the entire region is impacted, single-region accounts lose read and write access until service is restored.
83+
84+
**Recommended actions during a single-region outage:**
85+
86+
1. **Wait for service restoration** - Monitor the Service Health page and the account's Resource Health for updates. Azure teams work to restore service as quickly as possible.
87+
88+
1. **Consider account restore** - If the outage duration exceeds your RTO, request a restore to a different region through Azure Support. See [Periodic backup and restore](#periodic-backup-and-restore) for details.
89+
90+
1. **Plan for multi-region deployment** - To prevent future single-region outages, consider deploying to multiple regions.
91+
92+
### Single-write region, multiple-region accounts
93+
94+
For accounts configured with a single write region and one or more read regions, the impact and recovery approach depends on which region is affected.
95+
96+
#### Read region outage
97+
98+
If the account is configured as zone-redundant in the affected read region, it can sustain an availability zone outage without impacting read availability. For regional outages affecting a read region, consider these actions:
99+
100+
**SDK configuration for read resilience:**
101+
102+
- **Configure preferred regions list** - Ensure that a preferred regions list is used in the Azure Cosmos DB SDK configuration. The SDK automatically retries operations in another region if a preferred region becomes unavailable. During a read region outage, the SDK detects the region outage through backend response codes, marks it as unavailable, and routes future operations to the next available region in the preference list. Ensure that the preferred regions list is set correctly and aligns with business and latency requirements. For detailed guidance, see [Troubleshoot Azure Cosmos DB SDK availability](troubleshoot-sdk-availability.md).
103+
104+
**Consistency level considerations:**
105+
106+
Reads should typically remain unaffected during a regional outage if the preferred regions list is configured correctly, as the Azure Cosmos DB SDK automatically reroutes requests to the next available region. However, specific consistency levels or configurations can lead to disruptions:
107+
108+
- **Strong Consistency** - For accounts with only two regions, a read region outage impacts write availability because strong consistency requires [dynamic quorum](consistency-levels.md#dynamic-quorum) to maintain strict consistency guarantees. With only one operational region, quorum can't be achieved, leading to disruptions in both read and write operations.
109+
- **Mitigation**: Perform a [region offline operation](how-to-manage-database-account.yml) for the affected read region to restore availability. If service-managed failover is enabled, Azure Cosmos DB performs the region offline operation, but this could take time based on how the outage is progressing. For faster recovery, perform a [region offline operation](how-to-manage-database-account.yml#perform-forced-failover-for-your-azure-cosmos-db-account).
110+
111+
- **Bounded Staleness Consistency** - When the read region has an outage and the staleness window is exceeded, write operations for the partitions in the affected region are also impacted. This occurs because Bounded Staleness consistency relies on maintaining a specific staleness threshold between regions. When this threshold is breached, the system can no longer guarantee consistency for writes.
112+
- **Mitigation**: Perform a [region offline operation](how-to-manage-database-account.yml#perform-forced-failover-for-your-azure-cosmos-db-account) for the affected read region to restore availability.
113+
114+
To minimize these risks, consider deploying additional regions and reviewing consistency level settings to ensure alignment with the application's high availability and performance requirements.
115+
116+
#### Write region outage
117+
118+
If the account is configured as zone-redundant in the write region, it can sustain an availability zone outage without impacting write availability. When a regional outage affects the write region, you have several options:
119+
120+
##### Accounts enabled with per-partition automatic failover (PPAF) (Preview)
121+
122+
If an account is enabled with [per-partition automatic failover (PPAF)](how-to-configure-per-partition-automatic-failover.md), the service automatically manages failovers for partitions in an error state, ensuring service continuity during partial or complete regional outages.
123+
124+
125+
##### Region offline operation
126+
127+
If an account isn't enabled with PPAF, perform a region offline operation to restore availability.
128+
129+
The region offline operation removes the affected region from the account configuration, allowing the service to restore availability to the remaining regions. To initiate this operation, [take the region offline](how-to-manage-database-account.yml#perform-forced-failover-for-your-azure-cosmos-db-account) for the affected write region.
130+
131+
[![Diagram that shows the example of Offline Region operation in Azure Portal.](media/disaster-recovery-guidance/offline-region-failover.png)](media/disaster-recovery-guidance/offline-region-failover.png#lightbox)
132+
133+
134+
> [!NOTE]
135+
> If using private endpoints with an Azure Cosmos DB account, ensure that the private DNS is routing correctly after the offline region operation. For detailed guidance, see [Failover considerations for private endpoints](failover-considerations-for-private-endpoints.md).
136+
137+
138+
**Region restoration:**
139+
140+
- The Azure Cosmos DB team completes the online region operation, which might take three or more business days, depending on the account size.
141+
- Once the region is brought online, it's added as a read region. If this was the write region before the outage, manually change the write region to restore it as the write region when appropriate. This might require coordinating other changes made to the application or service during the outage.
142+
143+
144+
##### Service-managed failover
145+
146+
Service-managed failover allows Azure Cosmos DB to automatically perform region offline operations for affected regions to preserve business continuity.
147+
148+
**Configuration:**
149+
150+
- **Azure portal**: Navigate to the Azure Cosmos DB account, select **Replicate data globally**, and enable **Service Managed Failover**.
151+
- **Azure PowerShell**: Follow the instructions to enable [service managed failover](manage-with-powershell.md#enable-automatic-failover) via PowerShell cmdlets.
152+
- **Azure CLI**: Follow the instructions to enable [service managed failover](manage-with-cli.md#enable-service-managed-failover) via Azure CLI commands.
153+
154+
> [!IMPORTANT]
155+
> Even with service-managed failover enabled, the timing of automatic failover depends on the nature and progression of the outage. In these scenarios, failover might take up to one hour or more. To quickly restore write availability during outages, perform the [region offline operation](#region-offline-operation) instead of waiting for service-managed failover.
156+
157+
**Region restoration:**
158+
159+
- The Azure Cosmos DB team completes the online region operation, which might take three or more business days, depending on the outage extent and resolution. You can request a [support ticket](/azure/azure-portal/supportability/how-to-create-azure-support-request) to request online if the region is taking longer than expected to come back online.
160+
- Once the region is brought online, it's added as a read region. If this was the write region before the outage, manually switch the write region when appropriate.
161+
162+
##### Operations to avoid during region outages
163+
164+
> [!WARNING]
165+
> Don't perform any control plane operations on the affected region during outage scenarios, as they result in account inconsistency and delay recovery. Some of the example of control plane operations to avoid include:
166+
> - Change write region (manual failover) or modify failover priority
167+
> - Update the account to multi-write configuration
168+
> - Updating consistency levels or other account settings
169+
> - Updating private endpoint configurations or network settings
170+
> - Updating account throughput or scaling operations
171+
> - Any other operation that modifies the account configuration or region settings
172+
173+
### Multiple-write region accounts
174+
175+
If an Azure Cosmos DB account is configured with multiple write regions, the service automatically handles regional failures without requiring manual intervention. Applications can continue reading and writing data in available regions with minimal disruption.
176+
177+
**Best practices for multi-write accounts:**
178+
179+
- **Configure SDK** - Ensure that clients are configured to use multiple-write regions with `ApplicationRegion` or `PreferredRegions`. For more information, see [Configure multi-region writes in your application](how-to-multi-master.md).
180+
181+
- **Route traffic away from affected regions** - Use regional application health probes with Azure Traffic Manager or Azure Front Door to automatically route traffic away from the affected region.
182+
183+
**Key characteristics:**
184+
- No manual failover required during regional outages.
185+
- Applications automatically connect to healthy regions based on SDK configuration.
186+
- Write availability is maintained as long as at least one region is available.
187+
188+
For more information, see [Distribute data globally](distribute-data-globally.md) and [Multi-region writes](multi-region-writes.md).
189+
190+
### Continuous backup and point-in-time restore
191+
192+
If data recovery is needed due to accidental deletion or modification (data corruption), use Azure Cosmos DB's continuous backup mode with point-in-time restore (PITR). This feature allows restoring an account to any point in time within the retention period.
193+
194+
For detailed steps on performing a point-in-time restore, see [Continuous backup with point-in-time restore](continuous-backup-restore-introduction.md).
195+
196+
### Periodic backup and restore
197+
198+
If an account uses periodic backup mode, request a restore from Azure Support. Periodic backups are taken automatically at regular intervals, with both the interval (every four hours by default) and retention count (two most recent backups by default) being configurable.
199+
200+
To request a restore from periodic backups:
201+
202+
1. Create a support request in the Azure portal.
203+
204+
1. Select **Backup and Restore** as the problem type.
205+
206+
1. Provide the specific time for the restore.
207+
208+
For more information, see [Configure Azure Cosmos DB account with periodic backup](periodic-backup-restore-introduction.md).
209+
210+
## Related content
211+
212+
To learn more about disaster recovery and business continuity, review:
213+
214+
- [Distribute data globally with Azure Cosmos DB](distribute-data-globally.md)
215+
- [Continuous backup with point-in-time restore in Azure Cosmos DB](continuous-backup-restore-introduction.md)
216+
- [Multi-region writes in Azure Cosmos DB](multi-region-writes.md)
217+
- [Manage an Azure Cosmos DB account](how-to-manage-database-account.yml)
218+
- [Failover considerations for private endpoints](failover-considerations-for-private-endpoints.md)
70.5 KB
Loading
45.7 KB
Loading
355 KB
Loading
168 KB
Loading

0 commit comments

Comments
 (0)