EKS High Availability Architecture Guide
Written: 2026-02-10 | Updated: 2026-02-13 | Reading time: ~20 min
Reference Environment: EKS 1.30+, Karpenter v1.x, Istio 1.22+
1. Overview
Resiliency is the ability to recover to a normal state when facing failures, or to maintain service while minimizing failure impact. The core principle: Failures will happen — prepare through design.
Failure Domain Hierarchy
Pod failure → Node failure → AZ failure → Region failure → Global failure, each with corresponding defense strategies (Probes/PDB → Topology Spread → Multi-AZ/ARC → Multi-Region → Multi-Cloud).
Resiliency Maturity Model
| Level | Stage | Key Capabilities |
|---|---|---|
| 1 | Basic | Pod-level resilience: Probes, PDB, Graceful Shutdown |
| 2 | Multi-AZ | AZ fault tolerance: Topology Spread, ARC Zonal Shift |
| 3 | Cell-Based | Blast radius isolation: Cell Architecture, Shuffle Sharding |
| 4 | Multi-Region | Region fault tolerance: Active-Active, Global Accelerator |
2. Multi-AZ Strategy
- Pod Topology Spread Constraints with
minDomains(K8s 1.30 GA) - AZ-aware Karpenter NodePool with disruption budgets
- Node Readiness Controller (2026) for bootstrap completion guarantees
- ARC Zonal Shift for automatic/manual AZ traffic diversion
- EBS AZ-Pinning mitigation with
WaitForFirstConsumer - Cross-AZ cost optimization via Istio Locality-Aware Routing
3. Cell-Based Architecture
Independent, self-contained service units for blast radius isolation. Implementation via Namespace-based (soft) or Cluster-based (hard) cells. Cell Router implementation options: Route 53 ARC, ALB Target Groups, or Istio VirtualService. Shuffle Sharding for multi-tenant fault isolation.
4. Multi-Cluster / Multi-Region
Active-Active, Active-Passive, Regional Isolation, Hub-Spoke patterns. Global Accelerator + EKS, ArgoCD Multi-Cluster GitOps (ApplicationSets), Istio Multi-Cluster Federation.
5. Application Resiliency Patterns
PodDisruptionBudgets, Graceful Shutdown (preStop + terminationGracePeriodSeconds), Circuit Breaker (Istio DestinationRule), Retry/Timeout (Istio VirtualService), EKS Auto Mode considerations.
6. Chaos Engineering
AWS FIS (managed), Litmus Chaos (CNCF), Chaos Mesh (CNCF) — with scenarios for Pod deletion, AZ failure simulation, network latency injection. Game Day runbook template with 5-phase framework.