|
| 1 | +--- |
| 2 | +title: Disaster recovery guidance |
| 3 | +titleSuffix: Azure Cosmos DB |
| 4 | +description: Learn about disaster recovery guidance when using Azure Cosmos DB, including how to detect outages and recover your data. |
| 5 | +author: sushantrane |
| 6 | +ms.author: srane |
| 7 | +ms.service: azure-cosmos-db |
| 8 | +ms.topic: conceptual |
| 9 | +ms.date: 02/10/2026 |
| 10 | +appliesto: |
| 11 | + - ✅ NoSQL |
| 12 | + - ✅ MongoDB |
| 13 | + - ✅ Apache Cassandra |
| 14 | + - ✅ Apache Gremlin |
| 15 | + - ✅ Table |
| 16 | +--- |
| 17 | + |
| 18 | +# Disaster recovery guidance for Azure Cosmos DB |
| 19 | + |
| 20 | +Azure Cosmos DB provides industry-leading high availability with a comprehensive suite of built-in business continuity and disaster recovery (BCDR) capabilities. The service offers multiple availability guarantees depending on configuration, with SLAs up to 99.999% for multi-write region accounts or using Per Partition Automatic Failover (PPAF) capability. Azure Cosmos DB also provides turnkey disaster recovery capabilities that enable quick recovery in the event of a regional outage. |
| 21 | + |
| 22 | +Though Azure Cosmos DB continuously strives to provide high availability, the service might occasionally experience outages that cause an account to become unavailable and impact an application. When service monitoring detects widespread connectivity errors, failures, or performance issues, the service automatically declares an outage to keep you informed. |
| 23 | + |
| 24 | +This article provides guidance on preparing for potential service outages and the actions to take during an outage to ensure business continuity. |
| 25 | + |
| 26 | +## Service outage |
| 27 | + |
| 28 | +In the event of an Azure Cosmos DB service outage, you can find details related to the outage in the following places: |
| 29 | + |
| 30 | +### Azure portal banner |
| 31 | + |
| 32 | +If a subscription is identified as impacted, there's an outage alert of a Service Issue in the Azure portal Notifications. |
| 33 | + |
| 34 | +[](media/disaster-recovery-guidance/notification-service-issue-example.png#lightbox) |
| 35 | + |
| 36 | +### Help + support or Support + troubleshooting |
| 37 | + |
| 38 | +When you create a support ticket from **Help + support** or **Support + troubleshooting**, there's information about any issues impacting your resources. Select **View outage details** for more information and a summary of impact. There's also an alert in the **New support request** page. |
| 39 | +[](media/disaster-recovery-guidance/help-support-service-health-notification.png#lightbox) |
| 40 | + |
| 41 | +### Service Health |
| 42 | + |
| 43 | +The **Service Health** page in the Azure portal contains information about Azure data center status globally. Search for `service health` in the search bar in the Azure portal, then view **Service issues** in the **Active events** category. You can also view the health of individual resources in the **Resource health** page of any resource under the **Help** menu. |
| 44 | +[](media/disaster-recovery-guidance/service-health-service-issues-example-map.png#lightbox) |
| 45 | + |
| 46 | +### Email notification |
| 47 | + |
| 48 | +If alerts are configured, an email notification is sent from `azure-noreply@microsoft.com` when a service outage impacts a subscription and resource. For more information on service health alerts, see [Receive activity log alerts on Azure service notifications using Azure portal](/azure/service-health/alerts-activity-log-service-notifications-portal). |
| 49 | + |
| 50 | +### Service metrics |
| 51 | + |
| 52 | +You can [monitor and configure alerts for Azure Cosmos DB availability metrics](monitor.md) in the Azure portal. Azure Cosmos DB provides comprehensive metrics for monitoring availability, request rates, request units consumption, and storage. |
| 53 | + |
| 54 | +## When to initiate disaster recovery during an outage |
| 55 | + |
| 56 | +In the event of a service outage impacting application resources, consider the following courses of action: |
| 57 | + |
| 58 | +- Azure teams work diligently to restore service availability as quickly as possible, but depending on the root cause, recovery can sometimes take longer. If an application can tolerate downtime, wait for the recovery to complete. In this case, no action is required. View the health of individual resources on the **Resource health** page under the **Help** menu. Refer to the Resource health page for updates and the latest information about an outage. After the region recovers, application availability is restored. |
| 59 | + |
| 60 | +- If the outage duration approaches your RTO, decide whether to wait for service recovery or initiate disaster recovery. Depending on the application's tolerance for downtime and potential business liability, make an informed decision about how to respond to prolonged unavailability. |
| 61 | + |
| 62 | +## Outage recovery guidance |
| 63 | + |
| 64 | +The recovery approach for an Azure Cosmos DB account depends on the account configuration. This section provides detailed guidance based on different account types and outage scenarios. |
| 65 | + |
| 66 | +## Recovery options by account configuration |
| 67 | + |
| 68 | +The following table summarizes the recovery options available based on the Azure Cosmos DB account configuration and the type of outage: |
| 69 | + |
| 70 | +| Account configuration | Outage scenario | Recovery approach | Section reference | |
| 71 | +|---|---|---|---| |
| 72 | +| Single-region account | Region outage | Wait for service restoration or request account restore from backup to another region. | [Single-region accounts](#single-region-accounts) | |
| 73 | +| Single-write region, multiple-region account | Read region outage | SDK reroutes to available regions based on configuration; consider taking the region offline for strong consistency in two-region accounts. | [Read region outage](#read-region-outage) | |
| 74 | +| Single-write region, multiple-region account | Write region outage (with PPAF enabled) | Automatic partition-level failover. | [Accounts enabled with per-partition automatic failover](#accounts-enabled-with-per-partition-automatic-failover-ppaf-preview) | |
| 75 | +| Single-write region, multiple-region account | Write region outage (without PPAF) | Perform offline region operation. | [Region offline operation](#region-offline-operation)| |
| 76 | +| Multiple-write region account | Any region outage | Automatic routing to healthy regions via SDK configurations; no manual intervention required. | [Multiple-write region accounts](#multiple-write-region-accounts) | |
| 77 | +| Any account configuration | Data corruption or accidental deletion | Point-in-time restore (continuous backup) or restore from periodic backup. | [Continuous backup and point-in-time restore](#continuous-backup-and-point-in-time-restore), [Periodic backup and restore](#periodic-backup-and-restore) | |
| 78 | + |
| 79 | +--- |
| 80 | +### Single-region accounts |
| 81 | + |
| 82 | +A single-region account with **Availability Zones** can maintain read-write availability when an outage affects only one availability zone. However, if multiple availability zones or the entire region is impacted, single-region accounts lose read and write access until service is restored. |
| 83 | + |
| 84 | +**Recommended actions during a single-region outage:** |
| 85 | + |
| 86 | +1. **Wait for service restoration** - Monitor the Service Health page and the account's Resource Health for updates. Azure teams work to restore service as quickly as possible. |
| 87 | + |
| 88 | +1. **Consider account restore** - If the outage duration exceeds your RTO, request a restore to a different region through Azure Support. See [Periodic backup and restore](#periodic-backup-and-restore) for details. |
| 89 | + |
| 90 | +1. **Plan for multi-region deployment** - To prevent future single-region outages, consider deploying to multiple regions. |
| 91 | + |
| 92 | +### Single-write region, multiple-region accounts |
| 93 | + |
| 94 | +For accounts configured with a single write region and one or more read regions, the impact and recovery approach depends on which region is affected. |
| 95 | + |
| 96 | +#### Read region outage |
| 97 | + |
| 98 | +If the account is configured as zone-redundant in the affected read region, it can sustain an availability zone outage without impacting read availability. For regional outages affecting a read region, consider these actions: |
| 99 | + |
| 100 | +**SDK configuration for read resilience:** |
| 101 | + |
| 102 | +- **Configure preferred regions list** - Ensure that a preferred regions list is used in the Azure Cosmos DB SDK configuration. The SDK automatically retries operations in another region if a preferred region becomes unavailable. During a read region outage, the SDK detects the region outage through backend response codes, marks it as unavailable, and routes future operations to the next available region in the preference list. Ensure that the preferred regions list is set correctly and aligns with business and latency requirements. For detailed guidance, see [Troubleshoot Azure Cosmos DB SDK availability](troubleshoot-sdk-availability.md). |
| 103 | + |
| 104 | +**Consistency level considerations:** |
| 105 | + |
| 106 | +Reads should typically remain unaffected during a regional outage if the preferred regions list is configured correctly, as the Azure Cosmos DB SDK automatically reroutes requests to the next available region. However, specific consistency levels or configurations can lead to disruptions: |
| 107 | + |
| 108 | +- **Strong Consistency** - For accounts with only two regions, a read region outage impacts write availability because strong consistency requires [dynamic quorum](consistency-levels.md#dynamic-quorum) to maintain strict consistency guarantees. With only one operational region, quorum can't be achieved, leading to disruptions in both read and write operations. |
| 109 | + - **Mitigation**: Perform a [region offline operation](how-to-manage-database-account.yml) for the affected read region to restore availability. If service-managed failover is enabled, Azure Cosmos DB performs the region offline operation, but this could take time based on how the outage is progressing. For faster recovery, perform a [region offline operation](how-to-manage-database-account.yml#perform-forced-failover-for-your-azure-cosmos-db-account). |
| 110 | + |
| 111 | +- **Bounded Staleness Consistency** - When the read region has an outage and the staleness window is exceeded, write operations for the partitions in the affected region are also impacted. This occurs because Bounded Staleness consistency relies on maintaining a specific staleness threshold between regions. When this threshold is breached, the system can no longer guarantee consistency for writes. |
| 112 | + - **Mitigation**: Perform a [region offline operation](how-to-manage-database-account.yml#perform-forced-failover-for-your-azure-cosmos-db-account) for the affected read region to restore availability. |
| 113 | + |
| 114 | +To minimize these risks, consider deploying additional regions and reviewing consistency level settings to ensure alignment with the application's high availability and performance requirements. |
| 115 | + |
| 116 | +#### Write region outage |
| 117 | + |
| 118 | +If the account is configured as zone-redundant in the write region, it can sustain an availability zone outage without impacting write availability. When a regional outage affects the write region, you have several options: |
| 119 | + |
| 120 | +##### Accounts enabled with per-partition automatic failover (PPAF) (Preview) |
| 121 | + |
| 122 | +If an account is enabled with [per-partition automatic failover (PPAF)](how-to-configure-per-partition-automatic-failover.md), the service automatically manages failovers for partitions in an error state, ensuring service continuity during partial or complete regional outages. |
| 123 | + |
| 124 | + |
| 125 | +##### Region offline operation |
| 126 | + |
| 127 | +If an account isn't enabled with PPAF, perform a region offline operation to restore availability. |
| 128 | + |
| 129 | +The region offline operation removes the affected region from the account configuration, allowing the service to restore availability to the remaining regions. To initiate this operation, [take the region offline](how-to-manage-database-account.yml#perform-forced-failover-for-your-azure-cosmos-db-account) for the affected write region. |
| 130 | + |
| 131 | +[](media/disaster-recovery-guidance/offline-region-failover.png#lightbox) |
| 132 | + |
| 133 | + |
| 134 | +> [!NOTE] |
| 135 | +> If using private endpoints with an Azure Cosmos DB account, ensure that the private DNS is routing correctly after the offline region operation. For detailed guidance, see [Failover considerations for private endpoints](failover-considerations-for-private-endpoints.md). |
| 136 | +
|
| 137 | + |
| 138 | +**Region restoration:** |
| 139 | + |
| 140 | +- The Azure Cosmos DB team completes the online region operation, which might take three or more business days, depending on the account size. |
| 141 | +- Once the region is brought online, it's added as a read region. If this was the write region before the outage, manually change the write region to restore it as the write region when appropriate. This might require coordinating other changes made to the application or service during the outage. |
| 142 | + |
| 143 | + |
| 144 | +##### Service-managed failover |
| 145 | + |
| 146 | +Service-managed failover allows Azure Cosmos DB to automatically perform region offline operations for affected regions to preserve business continuity. |
| 147 | + |
| 148 | +**Configuration:** |
| 149 | + |
| 150 | +- **Azure portal**: Navigate to the Azure Cosmos DB account, select **Replicate data globally**, and enable **Service Managed Failover**. |
| 151 | +- **Azure PowerShell**: Follow the instructions to enable [service managed failover](manage-with-powershell.md#enable-automatic-failover) via PowerShell cmdlets. |
| 152 | +- **Azure CLI**: Follow the instructions to enable [service managed failover](manage-with-cli.md#enable-service-managed-failover) via Azure CLI commands. |
| 153 | + |
| 154 | +> [!IMPORTANT] |
| 155 | +> Even with service-managed failover enabled, the timing of automatic failover depends on the nature and progression of the outage. In these scenarios, failover might take up to one hour or more. To quickly restore write availability during outages, perform the [region offline operation](#region-offline-operation) instead of waiting for service-managed failover. |
| 156 | +
|
| 157 | +**Region restoration:** |
| 158 | + |
| 159 | +- The Azure Cosmos DB team completes the online region operation, which might take three or more business days, depending on the outage extent and resolution. You can request a [support ticket](/azure/azure-portal/supportability/how-to-create-azure-support-request) to request online if the region is taking longer than expected to come back online. |
| 160 | +- Once the region is brought online, it's added as a read region. If this was the write region before the outage, manually switch the write region when appropriate. |
| 161 | + |
| 162 | +##### Operations to avoid during region outages |
| 163 | + |
| 164 | +> [!WARNING] |
| 165 | +> Don't perform any control plane operations on the affected region during outage scenarios, as they result in account inconsistency and delay recovery. Some of the example of control plane operations to avoid include: |
| 166 | +> - Change write region (manual failover) or modify failover priority |
| 167 | +> - Update the account to multi-write configuration |
| 168 | +> - Updating consistency levels or other account settings |
| 169 | +> - Updating private endpoint configurations or network settings |
| 170 | +> - Updating account throughput or scaling operations |
| 171 | +> - Any other operation that modifies the account configuration or region settings |
| 172 | +
|
| 173 | +### Multiple-write region accounts |
| 174 | + |
| 175 | +If an Azure Cosmos DB account is configured with multiple write regions, the service automatically handles regional failures without requiring manual intervention. Applications can continue reading and writing data in available regions with minimal disruption. |
| 176 | + |
| 177 | +**Best practices for multi-write accounts:** |
| 178 | + |
| 179 | +- **Configure SDK** - Ensure that clients are configured to use multiple-write regions with `ApplicationRegion` or `PreferredRegions`. For more information, see [Configure multi-region writes in your application](how-to-multi-master.md). |
| 180 | + |
| 181 | +- **Route traffic away from affected regions** - Use regional application health probes with Azure Traffic Manager or Azure Front Door to automatically route traffic away from the affected region. |
| 182 | + |
| 183 | +**Key characteristics:** |
| 184 | +- No manual failover required during regional outages. |
| 185 | +- Applications automatically connect to healthy regions based on SDK configuration. |
| 186 | +- Write availability is maintained as long as at least one region is available. |
| 187 | + |
| 188 | +For more information, see [Distribute data globally](distribute-data-globally.md) and [Multi-region writes](multi-region-writes.md). |
| 189 | + |
| 190 | +### Continuous backup and point-in-time restore |
| 191 | + |
| 192 | +If data recovery is needed due to accidental deletion or modification (data corruption), use Azure Cosmos DB's continuous backup mode with point-in-time restore (PITR). This feature allows restoring an account to any point in time within the retention period. |
| 193 | + |
| 194 | +For detailed steps on performing a point-in-time restore, see [Continuous backup with point-in-time restore](continuous-backup-restore-introduction.md). |
| 195 | + |
| 196 | +### Periodic backup and restore |
| 197 | + |
| 198 | +If an account uses periodic backup mode, request a restore from Azure Support. Periodic backups are taken automatically at regular intervals, with both the interval (every four hours by default) and retention count (two most recent backups by default) being configurable. |
| 199 | + |
| 200 | +To request a restore from periodic backups: |
| 201 | + |
| 202 | +1. Create a support request in the Azure portal. |
| 203 | + |
| 204 | +1. Select **Backup and Restore** as the problem type. |
| 205 | + |
| 206 | +1. Provide the specific time for the restore. |
| 207 | + |
| 208 | +For more information, see [Configure Azure Cosmos DB account with periodic backup](periodic-backup-restore-introduction.md). |
| 209 | + |
| 210 | +## Related content |
| 211 | + |
| 212 | +To learn more about disaster recovery and business continuity, review: |
| 213 | + |
| 214 | +- [Distribute data globally with Azure Cosmos DB](distribute-data-globally.md) |
| 215 | +- [Continuous backup with point-in-time restore in Azure Cosmos DB](continuous-backup-restore-introduction.md) |
| 216 | +- [Multi-region writes in Azure Cosmos DB](multi-region-writes.md) |
| 217 | +- [Manage an Azure Cosmos DB account](how-to-manage-database-account.yml) |
| 218 | +- [Failover considerations for private endpoints](failover-considerations-for-private-endpoints.md) |
0 commit comments