You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/cosmos-db/disaster-recovery-guidance.md
+12-8Lines changed: 12 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -57,7 +57,7 @@ In the event of a service outage impacting application resources, consider the f
57
57
58
58
- Azure teams work diligently to restore service availability as quickly as possible, but depending on the root cause, recovery can sometimes take longer. If an application can tolerate downtime, wait for the recovery to complete. In this case, no action is required. View the health of individual resources on the **Resource health** page under the **Help** menu. Refer to the Resource health page for updates and the latest information about an outage. After the region recovers, application availability is restored.
59
59
60
-
- If the outage duration approaches the RTO, decide whether to wait for service recovery or initiate disaster recovery. Depending on the application's tolerance for downtime and potential business liability, make an informed decision about how to respond to prolonged unavailability.
60
+
- If the outage duration approaches your RTO, decide whether to wait for service recovery or initiate disaster recovery. Depending on the application's tolerance for downtime and potential business liability, make an informed decision about how to respond to prolonged unavailability.
61
61
62
62
## Outage recovery guidance
63
63
@@ -85,7 +85,7 @@ A single-region account with **Availability Zones** can maintain read-write avai
85
85
86
86
1.**Wait for service restoration** - Monitor the Service Health page and the account's Resource Health for updates. Azure teams work to restore service as quickly as possible.
87
87
88
-
1.**Consider account restoration** - If the outage duration exceeds the RTO, request a restore to a different region through Azure Support. See [Periodic backup and restore](#periodic-backup-and-restore) for details.
88
+
1.**Consider account restore** - If the outage duration exceeds your RTO, request a restore to a different region through Azure Support. See [Periodic backup and restore](#periodic-backup-and-restore) for details.
89
89
90
90
1.**Plan for multi-region deployment** - To prevent future single-region outages, consider deploying to multiple regions.
91
91
@@ -106,7 +106,7 @@ If the account is configured as zone-redundant in the affected read region, it c
106
106
Reads should typically remain unaffected during a regional outage if the preferred regions list is configured correctly, as the Azure Cosmos DB SDK automatically reroutes requests to the next available region. However, specific consistency levels or configurations can lead to disruptions:
107
107
108
108
-**Strong Consistency** - For accounts with only two regions, a read region outage impacts write availability because strong consistency requires [dynamic quorum](consistency-levels.md#dynamic-quorum) to maintain strict consistency guarantees. With only one operational region, quorum can't be achieved, leading to disruptions in both read and write operations.
109
-
-**Mitigation**: Perform a [region offline operation](how-to-manage-database-account.yml) for the affected read region to restore availability. If service-managed failover is enabled, Azure Cosmos DB eventually performs the region offline operation automatically, but this could take time based on how the outage is progressing. For faster recovery, perform a region offline operation or customer-managed failover.
109
+
-**Mitigation**: Perform a [region offline operation](how-to-manage-database-account.yml) for the affected read region to restore availability. If service-managed failover is enabled, Azure Cosmos DB performs the region offline operation, but this could take time based on how the outage is progressing. For faster recovery, perform a [region offline operation](how-to-manage-database-account.yml#perform-forced-failover-for-your-azure-cosmos-db-account).
110
110
111
111
-**Bounded Staleness Consistency** - When the read region has an outage and the staleness window is exceeded, write operations for the partitions in the affected region are also impacted. This occurs because Bounded Staleness consistency relies on maintaining a specific staleness threshold between regions. When this threshold is breached, the system can no longer guarantee consistency for writes.
112
112
-**Mitigation**: Perform a [region offline operation](how-to-manage-database-account.yml#perform-forced-failover-for-your-azure-cosmos-db-account) for the affected read region to restore availability.
@@ -143,13 +143,13 @@ The region offline operation removes the affected region from the account config
143
143
144
144
##### Service-managed failover
145
145
146
-
Service-managed failover allows Azure Cosmos DB to fail over the write region of a multiple-region account to preserve business continuity.
146
+
Service-managed failover allows Azure Cosmos DB to automatically perform region offline operations for affected regions to preserve business continuity.
147
147
148
148
**Configuration:**
149
149
150
150
-**Azure portal**: Navigate to the Azure Cosmos DB account, select **Replicate data globally**, and enable **Service Managed Failover**.
151
-
-**Azure PowerShell**: Follow the instructions to enable [service managed failover](manage-with-powershell.md#enable-automatic-failover).
152
-
-**Azure CLI**: Follow the instructions to enable [service managed failover](manage-with-cli.md#enable-service-managed-failover)
151
+
-**Azure PowerShell**: Follow the instructions to enable [service managed failover](manage-with-powershell.md#enable-automatic-failover) via PowerShell cmdlets.
152
+
-**Azure CLI**: Follow the instructions to enable [service managed failover](manage-with-cli.md#enable-service-managed-failover) via Azure CLI commands.
153
153
154
154
> [!IMPORTANT]
155
155
> Even with service-managed failover enabled, the timing of automatic failover depends on the nature and progression of the outage. In these scenarios, failover might take up to one hour or more. To quickly restore write availability during outages, perform the [region offline operation](#region-offline-operation) instead of waiting for service-managed failover.
@@ -162,9 +162,13 @@ Service-managed failover allows Azure Cosmos DB to fail over the write region of
162
162
##### Operations to avoid during region outages
163
163
164
164
> [!WARNING]
165
-
> Don't perform the following control plane operations during outage scenarios, as they result in account inconsistency and delay recovery:
165
+
> Don't perform any control plane operations on the affected region during outage scenarios, as they result in account inconsistency and delay recovery. Some of the example of control plane operations to avoid include:
166
166
> - Change write region (manual failover) or modify failover priority
167
167
> - Update the account to multi-write configuration
168
+
> - Updating consistency levels or other account settings
169
+
> - Updating private endpoint configurations or network settings
170
+
> - Updating account throughput or scaling operations
171
+
> - Any other operation that modifies the account configuration or region settings
168
172
169
173
### Multiple-write region accounts
170
174
@@ -191,7 +195,7 @@ For detailed steps on performing a point-in-time restore, see [Continuous backup
191
195
192
196
### Periodic backup and restore
193
197
194
-
If an account uses periodic backup mode, request a restore from Azure Support. Periodic backups are taken automatically at regular intervals(every four hours by default), and the two most recent backups are retained.
198
+
If an account uses periodic backup mode, request a restore from Azure Support. Periodic backups are taken automatically at regular intervals, with both the interval (every four hours by default) and retention count (two most recent backups by default) being configurable.
0 commit comments