Home / Projects / AWS Primary Kubernetes Platform with Azure Disaster Recovery
Multi-Cloud & Data Multi-Cloud 20 weeks

AWS Primary Kubernetes Platform with Azure Disaster Recovery

Designed and deployed a multi-cloud resilience pattern for a customer-facing multi-tier web application composed of a static frontend, Kubernetes-hosted APIs and workers, PostgreSQL, Redis, and object storage. The client needed provider-level disaster recovery rather than only regional resilience, with the production runtime hosted on AWS and a warm-standby recovery stack maintained on Azure.

AWSEKSAWS Load Balancer ControllerCloudflare DNSCloudflare CDNCloudflare Load BalancingCloudflare WAFRDS PostgreSQLElastiCacheS3AzureAKSNGINX IngressAzure Database for PostgreSQLAzure Cache for RedisBlob StorageKey VaultHelm

Architecture Diagram

AWS PRIMARY + AZURE DISASTER RECOVERY — ARCHITECTURE OVERVIEW Cloudflare DNS · CDN · Load Balancing · WAF · DDoS Protection PRIMARY WEB + API FAILOVER WEB + API AWS — PRIMARY (ACTIVE) S3 Static Frontend web origin · cached by Cloudflare ALB API Origin AWS Load Balancer Controller EKS API and Worker Workloads application tier · Helm-managed releases RDS PostgreSQL system of record · multi-AZ ElastiCache for Redis session · queue · cache S3 Objects uploads · exports · media ECR container registry Helm values · Kustomize overlays (shared with Azure) AZURE — WARM STANDBY Blob Static Frontend DR web origin AKS API Ingress NGINX ingress controller AKS API and Worker Workloads same release model · promoted on failover Azure Database for PostgreSQL Flexible Server · logical replica Azure Cache for Redis rebuilt on failover · no replication Blob Objects uploads · exports · media ACR image sync target Key Vault · Helm values · Kustomize overlays frontend build sync S3 static site -> Blob static site shared manifests logical replication continuous · async no replication S3 -> Blob objects scheduled sync ECR → ACR · image sync

Technical Implementation

  • Defined the application as a static web frontend served from S3 and cached at the edge by Cloudflare, with API and background worker services running on Amazon EKS behind an ALB managed by the AWS Load Balancer Controller. PostgreSQL, Redis, and object storage remained separate stateful tiers so the runtime model reflected a real multi-tier application rather than a single cluster-hosted service.
  • Built the Azure DR environment with Blob Storage for the static frontend and synchronized object assets, NGINX Ingress on AKS for API entry, Azure Database for PostgreSQL Flexible Server, Azure Cache for Redis, and Key Vault, keeping the Kubernetes manifests common through Helm values and environment overlays instead of maintaining a separate application definition per cloud.
  • Replicated container images from ECR to ACR, configured PostgreSQL logical replication from RDS PostgreSQL to Azure Database for PostgreSQL, and synchronized object assets from S3 to Blob Storage on a scheduled basis so the Azure environment remained warm and recoverable without being used as an active runtime.
  • Placed Cloudflare in front of both clouds for authoritative DNS, CDN caching, health-based traffic steering, WAF policy enforcement, and DDoS protection. Cloudflare served the S3 frontend and AWS ALB as the primary web and API origins, then failed over to Blob Storage and the AKS ingress endpoint when the AWS side was intentionally withdrawn during DR rehearsals. This moved the public control plane out of AWS without removing the need for per-cluster ingress and origin load balancing inside each cloud.

Client Delivery & Handover

The delivery was run jointly with the client application, platform, and operations teams because the work crossed cloud networking, edge security, Kubernetes operations, database replication, and release engineering. The client team participated in design reviews, pipeline implementation, and DR rehearsals rather than only reviewing the end state. Handover included cloud-by-cloud architecture diagrams, Cloudflare failover and security policy procedures, replication operating notes, AKS and EKS support guidance, and rehearsal sessions for both platform operators and support leads so the failover process could be repeated without external help.

Outcome

The client retained AWS as the primary operating environment while gaining a documented and tested cross-cloud recovery path that reduced dependence on AWS for both application hosting and public traffic steering during high-impact incidents.

Project Snapshot

Category

Multi-Cloud & Data

Sector

Multi-Cloud

Duration

20 weeks

Next Step

If this project is close to the work your team is planning, Ideamics can discuss comparable architectural decisions, delivery sequencing, and implementation tradeoffs in more detail.