Multi-Region Disaster Recovery with AWS Backup

work

2026-01-01

Multi-Region Disaster Recovery with AWS Backup¶

One of the projects I worked on before joining the Observability team was designing a disaster recovery strategy using AWS Backup. The goal was simple on paper: make sure critical data survives even if an entire region goes down or an account gets compromised.

In practice, it meant setting up cross-region and cross-account backup copies with proper isolation, covering 10 different resource types across multiple production accounts.

 Primary Account
 ┌─────────────────────────────────────────────────────┐
 │                                                     │
 │  us-east-1 (Virginia)        us-east-2 (Ohio)       │
 │  ┌───────────────┐           ┌───────────────┐      │
 │  │  AWS Backup   │ cross-    │    Backup     │      │
 │  │    Vault      │ region    │    Vault      │      │
 │  │  (per resource│─────────> │  (per resource│      │
 │  │   type)       │  5 AM UTC │   type)       │      │
 │  └───────────────┘           └───────────────┘      │
 │         │                                           │
 └─────────┼───────────────────────────────────────────┘
           │ cross-account copy
           │ 7 AM UTC (periodic snapshots)
           v
 Isolated Backup Account
 ┌─────────────────────────────────┐
 │  us-east-1 (Virginia)           │
 │  ┌───────────────┐              │
 │  │  LAG Vault    │              │
 │  │  (air-gapped) │  shared      │
 │  │               │  via RAM     │
 │  │  7-day        │<─ ─ ─ ─ ─ ─ ─  source account
 │  │  retention    │  restore        can restore
 │  └───────────────┘              │
 └─────────────────────────────────┘

What gets backed up¶

The strategy covers Aurora, DocumentDB, Neptune, RDS, S3, DynamoDB, EBS, and EFS natively through AWS Backup. OpenSearch and MemoryDB don't have native support, so those use a different approach (more on that later).

Each resource type gets its own dedicated vault in the workload account. Not a shared vault for everything, but one vault per resource type. This keeps the blast radius small and makes it easier to manage vault policies and KMS keys independently.

The whole thing runs exclusively in production. Dev and staging don't have centralized backups. Those environments are each team's responsibility.

Two backup plans per resource¶

Every resource is protected by two separate backup plans:

The cross-region plan runs at 5 AM UTC. It creates a recovery point in us-east-1 and copies it to us-east-2 (Ohio). The in-region copy has a 1-day retention, and the Ohio copy keeps 3 days. This covers regional outages.

The cross-account plan runs at 7 AM UTC. It creates another recovery point in us-east-1 and copies it to the isolated backup account's LAG vault. The in-region copy again retains for 1 day, and the LAG vault copy keeps 7 days. This covers account compromise scenarios.

Both plans have a 60-minute start window and a 360-minute (6 hour) completion window. If a job doesn't finish in 6 hours, it times out. The monitoring system triggers an alert at 5 hours 30 minutes to give us a heads-up before that happens.

The result: each resource gets two in-region recovery points daily (one from each plan), plus a copy in Ohio and a copy in the backup account.

Why two plans instead of one¶

The two-plan setup wasn't the original design. The first version had a single backup plan per resource with two copy actions in the same rule: one for cross-region, one for cross-account. Cleaner, no duplicate recovery points, less storage cost.

The problem is that AWS Backup applies all copy actions in a rule to every resource that enters the plan. There's no way to conditionally skip a specific copy action based on a resource tag. Tag-based filtering only works at the selection level (which resources enter the plan), not at the copy action level (which copies run for each resource).

This matters because some resources don't need the cross-region copy. DynamoDB Global Tables and Aurora Global Database already replicate across regions natively. Copying them to Ohio through AWS Backup is redundant and adds cost. But those same resources still need the cross-account copy to the LAG vault for ransomware protection, since native replication doesn't protect against account compromise.

With a single plan, it was all or nothing. Either a resource gets both copies or it gets none. There was no way to say "skip Ohio but keep the LAG vault copy."

We opened a support case with AWS to confirm this wasn't a configuration gap on our side. Their response was clear: this is a current service limitation. AWS Backup's architecture doesn't support conditional execution of individual copy actions within a backup rule based on resource tags. They recommended continuing with the two-plan workaround and suggested submitting a feature request, which we did.

The tradeoff is real. Two plans means two in-region recovery points per resource per day instead of one. That's duplicate storage for every resource that enters both plans. For accounts with hundreds of resources, it adds up. But it's the only way to give teams per-resource granularity on cross-region copies without losing cross-account protection.

If AWS ever ships conditional copy actions, the migration path is straightforward: merge both plans into one, move the exclude-cross-region-backup logic from the selection to the copy action, and cut the in-region storage in half.

Tag-based selection and opt-out¶

The backup plans use tag-based selection. By default, all supported resources in the production account are included. No action needed from teams to get their resources backed up.

If a team doesn't want a specific resource backed up (temporary data, caches, test tables), they have two tags available:

exclude-disaster-recovery-backup: "true" removes the resource from all backup plans. No in-region, no cross-region, no cross-account. Nothing.

exclude-cross-region-backup: "true" skips only the cross-region copy to Ohio. In-region and cross-account backups still run. This one is useful for resources that already replicate natively, like DynamoDB Global Tables or Aurora Global Database. Cross-region backups for those are redundant and just add cost.

For EBS specifically, volumes tagged with kubernetes.io/cluster/{clusterName}: "owned" are automatically excluded. These are ephemeral volumes managed by Kubernetes (Karpenter node pools, critical add-ons) that don't need backup protection.

Continuous backups where it matters¶

For S3 cross-account copies, we enabled continuous backups (enableContinuousBackup: true). Continuous backup does point-in-time recovery, which means you're not paying for full daily snapshots. You only store the changes. For workloads with large S3 datasets, this made a real difference in cost. The cross-account S3 plan retains 35 days in-region with continuous backup enabled, while the LAG vault copy keeps 7 days.

The cross-region plans for S3 and other resources use periodic snapshots instead. The tradeoff there is simpler lifecycle management at the cost of slightly higher storage for large datasets.

What is a logically air-gapped vault¶

A LAG vault is different from a standard AWS Backup vault. It's designed specifically for isolation. AWS manages the encryption keys (you can't bring your own), and the vault enforces a minimum retention period that nobody can override or shorten. Not even an account admin.

The idea is that even if someone gains full access to the account, they can't delete or tamper with the backups inside a LAG vault before the retention expires. It's the closest thing to a physical air gap without actually disconnecting anything.

In our case, the LAG vault in the isolated account had a 7-day retention. Short enough to keep costs reasonable, long enough to recover from most disaster scenarios.

The LAG vault lives in the isolated account, but the source account still needs to be able to restore from it in a disaster scenario. That's where Resource Access Manager (RAM) comes in.

The isolated account shares the LAG vault via RAM with the source account. Once the share is accepted, the source account can see and restore from the recovery points stored in the LAG vault. Without RAM, the backups would be locked in the isolated account with no way for the primary account to access them when needed.

AWS Backup handles the copy jobs into the LAG vault natively through cross-account copy rules. RAM is specifically for giving the source account read access to restore from those backups.

Services that AWS Backup doesn't cover¶

OpenSearch and MemoryDB don't have native AWS Backup support. Instead of leaving them unprotected, we built CronJobs that generate snapshots and store them in dedicated S3 buckets:

OpenSearch snapshots go to <prefix>-<CLUSTER_NAME>-opensearch-snapshots, triggered daily at 2 AM UTC. The script discovers all OpenSearch domains in the account, creates a snapshot repository pointing to the S3 bucket, and takes the snapshot. All domains in the account share a single bucket per cluster.

MemoryDB snapshots go to <prefix>-<CLUSTER_NAME>-memorydb-snapshots, triggered every 2 hours. The script checks for existing snapshots from the current day and two days ago, and exports any missing ones via the copy-snapshot CLI command.

Both buckets have a 2-day retention for the raw snapshots. But here's the key part: once the snapshots land in S3, they're automatically picked up by the standard S3 backup plans. So they get the same cross-region and cross-account protection as any other S3 bucket. The CronJob is just the bridge to get them into a format that AWS Backup can handle.

Infrastructure as code with Crossplane¶

The entire setup is managed as Helm charts deployed through Crossplane. Each resource type (S3, EBS, DynamoDB, EFS, RDS) has its own set of templates for vaults, vault policies, backup plans, selections, and KMS keys.

The chart is structured into three layers:

The origin chart runs in the primary region (us-east-1). It creates the vaults, backup plans, and selections for each resource type. This is where the schedules, retention policies, and copy rules are defined.

The cross-region chart runs in the destination region (us-east-2). It creates the destination vaults and their KMS keys. Each resource type gets its own vault and key in Ohio.

The archive chart runs in the isolated backup account. It creates the vault and KMS configuration that receives the cross-account copies.

In production, all resource types are always enabled. There are no feature flags or toggles. Every account gets the full set of vaults, plans, and selections for S3, EBS, EFS, DynamoDB, and RDS. The toggle mechanism (ebs.active: true/false, etc.) only exists in the staging chart, where we test changes before rolling them out.

Vault policies explicitly allow only the source account to copy into the vault. The KMS keys in the cross-region chart grant the source account permissions to encrypt and decrypt, with grants restricted to AWS resources only.

KMS requirements¶

For resources encrypted with customer-managed KMS keys (Aurora, DocumentDB, Neptune, RDS, EBS), the key policy needs to grant the backup account permissions to use the key. Specifically, kms:DescribeKey, kms:Encrypt, kms:Decrypt, kms:ReEncrypt*, kms:GenerateDataKey, and kms:GenerateDataKeyWithoutPlaintext, plus grant management actions.

DynamoDB and EFS don't need this. AWS Backup automatically encrypts their backups using the KMS key associated with the vault itself.

S3 buckets need versioning enabled and must be under 50TB. ACLs must be disabled (Object Ownership set to Bucket owner enforced).

If teams use the standard provisioning automation, all of this is already configured. If they provisioned resources manually, they need to add the KMS policy themselves.

Monitoring¶

We built two layers of monitoring around this:

Status alerts fire when any backup or copy job finishes with a status other than Completed. These need immediate investigation.

Runtime alerts fire when a job exceeds 5 hours 30 minutes. Since the completion window is 6 hours, this gives us a 30-minute buffer to investigate before the job times out.

On top of that, there's a Metabase dashboard that tracks the percentage of resources covered by data protection in each account. It helps catch drift, like a new resource that got provisioned without the right tags, or a team that accidentally excluded something they shouldn't have.

DR planning is one of those things that feels boring until you need it. The hard part isn't the tooling. AWS Backup makes the mechanics straightforward. The hard part is getting the account structure, vault type, and sharing permissions right so that the isolation is real.

A few things that stuck with me:

LAG vaults are worth the tradeoff. You lose KMS key control, but you gain tamper-proof retention. For a DR vault, that's the right trade.

One vault per resource type keeps things clean. Shared vaults sound simpler, but they make policy management and key rotation harder as you scale.

Tag-based selection with opt-out is the right default. Making backup automatic and letting teams exclude what they don't need is much safer than requiring teams to opt in. People forget to opt in. They rarely forget to opt out of something they don't want.

The CronJob bridge for unsupported services works well. OpenSearch and MemoryDB snapshots landing in S3 means they get the same protection as everything else, with no special handling in the backup plans.

Continuous backups for S3 save real money when you're doing cross-account copies of large datasets. The 35-day retention with PITR is more useful than 35 daily snapshots.

Test restores regularly. A backup that can't be restored is not a backup.