跳到主要内容

EKS High Availability Architecture Guide

Written: 2026-02-10 | Updated: 2026-02-13 | Reading time: ~20 min

Reference Environment: EKS 1.30+, Karpenter v1.x, Istio 1.22+

1. Overview

Resiliency is the ability to recover to a normal state when facing failures, or to maintain service while minimizing failure impact. The core principle: Failures will happen — prepare through design.

Failure Domain Hierarchy

Pod failure → Node failure → AZ failure → Region failure → Global failure, each with corresponding defense strategies (Probes/PDB → Topology Spread → Multi-AZ/ARC → Multi-Region → Multi-Cloud).

Resiliency Maturity Model

LevelStageKey Capabilities
1BasicPod-level resilience: Probes, PDB, Graceful Shutdown
2Multi-AZAZ fault tolerance: Topology Spread, ARC Zonal Shift
3Cell-BasedBlast radius isolation: Cell Architecture, Shuffle Sharding
4Multi-RegionRegion fault tolerance: Active-Active, Global Accelerator

2. Multi-AZ Strategy

  • Pod Topology Spread Constraints with minDomains (K8s 1.30 GA)
  • AZ-aware Karpenter NodePool with disruption budgets
  • Node Readiness Controller (2026) for bootstrap completion guarantees
  • ARC Zonal Shift for automatic/manual AZ traffic diversion
  • EBS AZ-Pinning mitigation with WaitForFirstConsumer
  • Cross-AZ cost optimization via Istio Locality-Aware Routing

3. Cell-Based Architecture

Independent, self-contained service units for blast radius isolation. Implementation via Namespace-based (soft) or Cluster-based (hard) cells. Cell Router implementation options: Route 53 ARC, ALB Target Groups, or Istio VirtualService. Shuffle Sharding for multi-tenant fault isolation.

4. Multi-Cluster / Multi-Region

Active-Active, Active-Passive, Regional Isolation, Hub-Spoke patterns. Global Accelerator + EKS, ArgoCD Multi-Cluster GitOps (ApplicationSets), Istio Multi-Cluster Federation.

5. Application Resiliency Patterns

PodDisruptionBudgets, Graceful Shutdown (preStop + terminationGracePeriodSeconds), Circuit Breaker (Istio DestinationRule), Retry/Timeout (Istio VirtualService), EKS Auto Mode considerations.

6. Chaos Engineering

AWS FIS (managed), Litmus Chaos (CNCF), Chaos Mesh (CNCF) — with scenarios for Pod deletion, AZ failure simulation, network latency injection. Game Day runbook template with 5-phase framework.

7. Resiliency Checklist

Level 1-4 checklists covering Probes, PDB, Topology Spread, Karpenter, ARC, Cell Architecture, Multi-Region, and cost optimization tips.


References