Skip to main content

EKS High Availability Architecture Guide

📅 Written: 2026-02-10 | Last Modified: 2026-02-13 | ⏱️ Reading Time: ~17 min

📅 Published: 2026-02-10 | ⏱️ Reading time: ~20 min

📌 Reference environment: EKS 1.30+, Karpenter v1.x, Istio 1.22+

1. Overview

Resiliency is the ability of a system to recover to a normal state when facing failures, or to maintain service while minimizing the impact of failures. The core principle of resiliency in cloud-native environments is simple: Failures will happen — design for them.

From single Pod failures to region-wide outages, understanding the Failure Domains at each layer and establishing corresponding defense strategies is the key to EKS operations.

Failure Domain Hierarchy

Resiliency Maturity Model

Organizations can classify their resiliency level into 4 stages and progressively advance from their current position.

LevelStageCore CapabilitiesImplementation ItemsComplexityCost Impact
1BasicPod-level resilienceProbe configuration, PDB, Graceful Shutdown, Resource LimitsLowMinimal
2Multi-AZAZ fault toleranceTopology Spread, Multi-AZ NodePool, ARC Zonal ShiftMediumCross-AZ traffic cost
3Cell-BasedBlast Radius isolationCell Architecture, Shuffle Sharding, independent deploymentsHighPer-Cell overhead
4Multi-RegionRegion fault toleranceActive-Active architecture, Global Accelerator, data replicationVery HighPer-region infrastructure cost
Incident Diagnosis & Response Guide Reference

For operational troubleshooting and incident resolution, refer to the EKS Incident Diagnosis and Response Guide. This document focuses on failure prevention and design, while real-time troubleshooting is covered in the incident diagnosis guide.


2. Multi-AZ Strategy

Multi-AZ deployment is the most fundamental yet powerful strategy for EKS resiliency. It distributes workloads across multiple Availability Zones so that a single AZ failure does not bring down the entire service.

Pod Topology Spread Constraints

Topology Spread Constraints distribute Pods evenly across AZs, nodes, and custom topology domains. In K8s 1.30+, the minDomains parameter allows specifying the minimum number of spread domains.

ParameterDescriptionRecommended Value
maxSkewMaximum Pod count difference between domainsAZ: 1, Node: 2
topologyKeyLabel to spread bytopology.kubernetes.io/zone
whenUnsatisfiableBehavior when constraint cannot be metDoNotSchedule (hard) or ScheduleAnyway (soft)
minDomainsMinimum number of spread domainsSame as the number of AZs (e.g., 3)
labelSelectorTarget Pod selectionSame as the Deployment's matchLabels

Hard + Soft Combination Strategy (recommended):

apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-app
spec:
replicas: 6
selector:
matchLabels:
app: critical-app
template:
metadata:
labels:
app: critical-app
spec:
topologySpreadConstraints:
# Hard: Even distribution across AZs (strictly enforced)
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: critical-app
minDomains: 3
# Soft: Distribution across nodes (best effort)
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: critical-app
maxSkew Configuration Tip

maxSkew: 1 ensures the strictest even distribution. Deploying 6 replicas across 3 AZs places exactly 2 in each AZ. If scaling speed is important, setting maxSkew: 2 provides more scheduling flexibility with a looser distribution.

AZ-aware Karpenter Configuration

In Karpenter v1 GA, Multi-AZ distribution, Disruption budgets, and Spot + On-Demand mixed strategies are declaratively configured per NodePool.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: multi-az-pool
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 5m
# Disruption budget: Limit so that no more than 20% of nodes are disrupted simultaneously
budgets:
- nodes: "20%"
template:
spec:
requirements:
# Provision nodes across 3 AZs
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b", "us-east-1c"]
# Mixed Spot + On-Demand for cost optimization + stability
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- c6i.xlarge
- c6i.2xlarge
- c6i.4xlarge
- c7i.xlarge
- c7i.2xlarge
- c7i.4xlarge
- m6i.xlarge
- m6i.2xlarge
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: multi-az
limits:
cpu: "1000"
memory: 2000Gi
Spot Instances and Multi-AZ

Spot instances have different capacity pools per AZ. Specifying 15 or more diverse instance types minimizes provisioning failures due to insufficient Spot capacity. Always run the base capacity for mission-critical workloads on On-Demand.

Safe Workload Placement with Node Readiness

In Multi-AZ environments, when new nodes are provisioned, the node may reach Ready state before it is actually prepared to accept workloads. Leverage these Kubernetes readiness mechanisms to prevent premature scheduling.

Node Readiness Controller (Announced February 2026)

Node Readiness Controller declaratively manages custom taints during the node bootstrap process, delaying workload scheduling until all infrastructure requirements are met — including GPU drivers, CNI plugins, CSI drivers, and security agents.

Resiliency Benefits:

  • During AZ failure recovery: When Karpenter provisions nodes in a new AZ, workloads only land after the node is fully prepared
  • Scale-out events: Even during rapid scaling, workloads are not placed on incomplete nodes
  • GPU/ML workloads: Prevents scheduling before driver loading completes, avoiding CrashLoopBackOff

Pod Scheduling Readiness (K8s 1.30 GA)

schedulingGates allow controlling scheduling timing from the Pod side. External systems verify readiness and then remove the gate to allow scheduling:

apiVersion: v1
kind: Pod
metadata:
name: validated-pod
spec:
schedulingGates:
- name: "example.com/capacity-validation"
- name: "example.com/security-clearance"
containers:
- name: app
image: app:latest
resources:
requests:
cpu: "4"
memory: "8Gi"

Use Cases:

  • Allow scheduling only after resource quota pre-validation
  • Allow scheduling only after security clearance
  • Allow scheduling only after custom admission checks pass

Pod Readiness Gates (AWS LB Controller)

AWS Load Balancer Controller's Pod Readiness Gates ensure zero-downtime deployments during rolling updates:

apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
elbv2.k8s.aws/pod-readiness-gate-inject: enabled # Enable auto-injection

Old Pods are not terminated until new Pods are registered as ALB/NLB targets and pass health checks, ensuring zero traffic loss during deployments.

Readiness Feature Selection Guide
RequirementRecommended FeatureLevel
Guarantee node bootstrap completionNode Readiness ControllerNode
External validation before Pod schedulingPod Scheduling ReadinessPod
Receive traffic only after LB registrationPod Readiness GatesPod
GPU/specialized hardware readinessNode Readiness ControllerNode
Zero-downtime rolling deploymentsPod Readiness GatesPod

AZ Avoidance Deployment Strategy (ARC Zonal Shift)

AWS Application Recovery Controller (ARC) Zonal Shift automatically or manually redirects traffic away from an AZ when issues are detected. EKS has supported ARC Zonal Shift since November 2024.

Enabling and Using ARC Zonal Shift:

# Enable Zonal Shift on the EKS cluster
aws eks update-cluster-config \
--name my-cluster \
--zonal-shift-config enabled=true

# Start manual Zonal Shift (redirect traffic away from a specific AZ)
aws arc-zonal-shift start-zonal-shift \
--resource-identifier arn:aws:eks:us-east-1:123456789012:cluster/my-cluster \
--away-from us-east-1b \
--expires-in 3h \
--comment "AZ-b impairment detected via Health Dashboard"

# Check Zonal Shift status
aws arc-zonal-shift list-zonal-shifts \
--resource-identifier arn:aws:eks:us-east-1:123456789012:cluster/my-cluster
Zonal Shift Limitations

The maximum duration for a Zonal Shift is 3 days and can be extended if needed. Enabling Zonal Autoshift allows AWS to automatically detect AZ-level impairments and shift traffic.

Emergency AZ Evacuation Script:

#!/bin/bash
# az-evacuation.sh - Safely evacuate all workloads from an impaired AZ
IMPAIRED_AZ=$1

if [ -z "$IMPAIRED_AZ" ]; then
echo "Usage: $0 <az-name>"
echo "Example: $0 us-east-1b"
exit 1
fi

echo "=== AZ Evacuation: ${IMPAIRED_AZ} ==="

# 1. Cordon nodes in the affected AZ (prevent new Pod scheduling)
echo "[Step 1] Cordoning nodes in ${IMPAIRED_AZ}..."
kubectl get nodes -l topology.kubernetes.io/zone=${IMPAIRED_AZ} -o name | \
xargs -I {} kubectl cordon {}

# 2. Drain nodes in the affected AZ (safely move existing Pods)
echo "[Step 2] Draining nodes in ${IMPAIRED_AZ}..."
kubectl get nodes -l topology.kubernetes.io/zone=${IMPAIRED_AZ} -o name | \
xargs -I {} kubectl drain {} \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=30 \
--timeout=120s

# 3. Verify evacuation results
echo "[Step 3] Verifying evacuation..."
echo "Remaining pods in ${IMPAIRED_AZ}:"
kubectl get pods --all-namespaces -o wide | grep ${IMPAIRED_AZ} | grep -v DaemonSet

echo "=== Evacuation complete ==="

EBS AZ-Pinning Mitigation

EBS volumes are pinned to a specific AZ. If that AZ experiences a failure, Pods using the volume cannot be moved to another AZ.

WaitForFirstConsumer StorageClass (recommended):

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: topology-aware-ebs
provisioner: ebs.csi.aws.com
parameters:
type: gp3
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

WaitForFirstConsumer defers volume creation until the Pod is scheduled, ensuring the volume is created in the same AZ as the Pod.

EFS Cross-AZ Alternative: For workloads that require storage access even during an AZ failure, use Amazon EFS. EFS allows simultaneous access from all AZs, eliminating the AZ-Pinning problem.

StorageAZ DependencyBehavior During FailureSuitable Workloads
EBS (gp3)Single AZ pinnedInaccessible during AZ failureDatabases, stateful apps
EFSCross-AZAccessible during AZ failureShared files, CMS, logs
Instance StoreNode-boundData lost on node terminationTemporary cache, scratch

Cross-AZ Cost Optimization

The primary cost factor of Multi-AZ deployments is Cross-AZ network traffic. Within the same region, inter-AZ data transfer costs $0.01/GB in each direction on AWS.

Istio Locality-Aware Routing can minimize Cross-AZ traffic:

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: locality-aware-routing
spec:
host: backend-service
trafficPolicy:
connectionPool:
http:
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
loadBalancer:
localityLbSetting:
enabled: true
# Prefer same AZ, failover to other AZs on failure
distribute:
- from: "us-east-1/us-east-1a/*"
to:
"us-east-1/us-east-1a/*": 80
"us-east-1/us-east-1b/*": 10
"us-east-1/us-east-1c/*": 10
- from: "us-east-1/us-east-1b/*"
to:
"us-east-1/us-east-1b/*": 80
"us-east-1/us-east-1a/*": 10
"us-east-1/us-east-1c/*": 10
Cross-AZ Cost Savings

Applying Locality-Aware Routing keeps 80%+ of traffic within the same AZ, significantly reducing Cross-AZ data transfer costs. For high-traffic services, this can save thousands of dollars per month.


3. Cell-Based Architecture

Cell-Based Architecture is an advanced resiliency pattern recommended by the AWS Well-Architected Framework. It partitions the system into independent Cells to isolate the Blast Radius of failures.

Cell Concepts and Design Principles

A Cell is a self-contained service unit that can operate independently. If one Cell experiences a failure, other Cells are not affected.

Core Cell Design Principles:

  1. Independence: Each Cell has its own data store, cache, and queue
  2. Isolation: No direct communication between Cells — coordination only through the Control Plane
  3. Homogeneity: All Cells run the same code and configuration
  4. Scalability: Growth is achieved by adding new Cells rather than scaling existing ones

Cell Implementation in EKS

Implementation ApproachNamespace-Based CellCluster-Based Cell
Isolation LevelLogical isolation (soft)Physical isolation (hard)
Resource IsolationResourceQuota, LimitRangeFull cluster isolation
Network IsolationNetworkPolicyVPC/Subnet level
Blast RadiusPotential impact within the same clusterComplete isolation between Cells
Operational ComplexityLow (single cluster)High (multi-cluster)
CostLowHigh (Control Plane cost x number of Cells)
Suitable ForSmall to medium scale, internal servicesLarge scale, regulatory compliance

Namespace-Based Cell Implementation Example:

# Cell-1 Namespace and ResourceQuota
apiVersion: v1
kind: Namespace
metadata:
name: cell-1
labels:
cell-id: "cell-1"
partition: "customers-a-h"
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: cell-1-quota
namespace: cell-1
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "100"
---
# Cell-aware Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
namespace: cell-1
labels:
cell-id: "cell-1"
spec:
replicas: 4
selector:
matchLabels:
app: api-server
cell-id: "cell-1"
template:
metadata:
labels:
app: api-server
cell-id: "cell-1"
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
cell-id: "cell-1"
containers:
- name: api-server
image: myapp/api-server:v2.1
env:
- name: CELL_ID
value: "cell-1"
- name: PARTITION_RANGE
value: "A-H"
resources:
requests:
cpu: "500m"
memory: 1Gi
limits:
cpu: "1"
memory: 2Gi

Cell Router Implementation

The Cell Router is the core component that routes incoming requests to the appropriate Cell. There are three implementation approaches.

1. Route 53 ARC Routing Control-Based:

Controls Cell routing at the DNS level. Health Checks and Routing Controls are configured for each Cell to block traffic at the DNS level when a Cell fails.

2. ALB Target Group-Based:

Uses ALB Weighted Target Groups to distribute traffic per Cell. Header-based routing rules implement per-customer Cell mapping.

3. Service Mesh-Based (Istio):

Uses Istio VirtualService header-based routing to implement Cell routing. This is the most flexible approach but adds Istio operational complexity.

Blast Radius Isolation Strategies

StrategyDescriptionIsolation CriterionUse Case
Customer PartitioningCell assignment based on customer ID hashCustomer groupsSaaS platforms
GeographicCell assignment based on geographic locationRegion/CountryGlobal services
Capacity-BasedDynamic assignment based on Cell capacityAvailable resourcesServices with high traffic variability
Tier-BasedCell assignment based on customer tierService levelPremium/Standard separation

Shuffle Sharding Pattern

Shuffle Sharding assigns each customer (or tenant) to a small number of Cells randomly selected from the total Cell pool. This ensures that a single Cell failure affects only a small subset of customers.

Principle: With 8 Cells and 2 Cells assigned per customer, there are C(8,2) = 28 possible combinations. If one specific Cell fails, only customers using that Cell are affected, and they automatically failover to their other assigned Cell.

# Shuffle Sharding ConfigMap example
apiVersion: v1
kind: ConfigMap
metadata:
name: shuffle-sharding-config
data:
sharding-config.yaml: |
totalCells: 8
shardsPerTenant: 2
tenantAssignments:
tenant-acme:
cells: ["cell-1", "cell-5"]
primary: "cell-1"
tenant-globex:
cells: ["cell-3", "cell-7"]
primary: "cell-3"
tenant-initech:
cells: ["cell-2", "cell-6"]
primary: "cell-2"
Cell Architecture Trade-offs

Cell Architecture provides strong isolation but increases operational complexity and cost. Since each Cell has its own data store, additional design is required for data migration, Cross-Cell queries, and inter-Cell consistency. Consider adoption starting with services that require 99.99%+ SLA.


4. Multi-Cluster / Multi-Region

Multi-Cluster and Multi-Region strategies prepare for region-level failures.

Architecture Pattern Comparison

PatternDescriptionRTORPOCostComplexitySuitable For
Active-ActiveAll regions serve traffic simultaneously~0~0Very HighVery HighGlobal services, extreme SLA
Active-PassiveOnly one region is active, others on standbyMin to HoursMinutesHighHighMost business applications
Regional IsolationIndependent operation per region, data isolationIndependent per regionN/AMediumMediumRegulatory compliance, data sovereignty
Hub-SpokeCentral Hub manages, Spokes serveMinutesSec to MinMedium to HighMediumManagement efficiency focused

Global Accelerator + EKS

AWS Global Accelerator leverages the AWS global network to route traffic to the EKS cluster in the region closest to the user.

ArgoCD Multi-Cluster GitOps

ArgoCD ApplicationSet Generator automates consistent deployments across multiple clusters.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: multi-cluster-app
namespace: argocd
spec:
generators:
# Dynamic deployment based on cluster labels
- clusters:
selector:
matchLabels:
environment: production
resiliency-tier: "high"
template:
metadata:
name: 'myapp-{{name}}'
spec:
project: default
source:
repoURL: https://github.com/myorg/k8s-manifests.git
targetRevision: main
path: 'overlays/{{metadata.labels.region}}'
destination:
server: '{{server}}'
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m

Istio Multi-Cluster Federation

The Istio Multi-Primary setup operates independent Istio Control Planes in each cluster while providing cross-cluster service discovery and load balancing.

# Istio Locality-Aware Routing (Multi-Region)
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: multi-region-routing
spec:
host: backend-service
trafficPolicy:
loadBalancer:
localityLbSetting:
enabled: true
# Prefer same region, failover to other regions on failure
failover:
- from: us-east-1
to: eu-west-1
- from: eu-west-1
to: us-east-1
- from: ap-northeast-2
to: ap-southeast-1
outlierDetection:
consecutive5xxErrors: 3
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50
Istio API Version Note

In Istio 1.22+, both networking.istio.io/v1 and networking.istio.io/v1beta1 are supported. For new deployments, v1 is recommended, and existing v1beta1 configurations remain valid.


5. Application Resiliency Patterns

In addition to infrastructure-level resiliency, application-level fault tolerance patterns must be implemented.

PodDisruptionBudgets (PDB)

PDB ensures minimum Pod availability during Voluntary Disruptions — such as node drains, cluster upgrades, and Karpenter consolidation.

SettingBehaviorSuitable For
minAvailable: 2Always maintain at least 2 PodsServices with few replicas (3-5)
minAvailable: "50%"Maintain at least 50% of totalServices with many replicas
maxUnavailable: 1Allow at most 1 disruption at a timeStability during rolling updates
maxUnavailable: "25%"Allow up to 25% simultaneous disruptionWhen fast deployments are needed
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api-server
---
# Percentage-based PDB suitable for large Deployments
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: worker-pdb
spec:
maxUnavailable: "25%"
selector:
matchLabels:
app: worker
PDB and Karpenter Interaction

Karpenter's Disruption budget (budgets: - nodes: "20%") and PDB work together. Karpenter respects PDB during node consolidation. If PDB is too strict (e.g., minAvailable equals the replica count), node drains may be permanently blocked, so use caution.

Graceful Shutdown

The Graceful Shutdown pattern safely completes in-flight requests and stops accepting new requests when a Pod is terminated.

apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: web
image: myapp/web:v2.0
ports:
- containerPort: 8080
lifecycle:
preStop:
exec:
# 1. Sleep to wait for Endpoint removal (prevent race between Kubelet and Endpoint Controller)
# 2. Send SIGTERM to initiate application Graceful Shutdown
command: ["/bin/sh", "-c", "sleep 5 && kill -TERM 1"]
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 1

Graceful Shutdown Timing Design:

Why preStop sleep Is Needed

When Kubernetes deletes a Pod, the preStop Hook execution and Endpoint removal occur asynchronously. Adding a 5-second sleep to the preStop ensures that the Endpoint Controller has time to remove the Pod IP from the Service, preventing traffic from being routed to a Pod that is shutting down.

Circuit Breaker (Istio DestinationRule)

The Circuit Breaker blocks requests to a failing service to prevent Cascading Failures. It is implemented using Istio's DestinationRule.

# Istio 1.22+: Both v1 and v1beta1 are supported
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: backend-circuit-breaker
spec:
host: backend-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
connectTimeout: 5s
http:
http1MaxPendingRequests: 50
http2MaxRequests: 100
maxRequestsPerConnection: 10
maxRetries: 3
outlierDetection:
# Remove instance from pool after 5 consecutive 5xx errors
consecutive5xxErrors: 5
# Check instance health every 30 seconds
interval: 30s
# Minimum isolation time for ejected instances
baseEjectionTime: 30s
# Allow ejection of up to 50% of total instances
maxEjectionPercent: 50

Retry / Timeout (Istio VirtualService)

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: backend-retry
spec:
hosts:
- backend-service
http:
- route:
- destination:
host: backend-service
timeout: 10s
retries:
attempts: 3
perTryTimeout: 3s
retryOn: "5xx,reset,connect-failure,retriable-4xx"
retryRemoteLocalities: true

Retry Best Practices:

SettingRecommended ValueRationale
attempts2-3Too many retries amplify load
perTryTimeout1/3 of total timeoutAllows 3 retries to complete within the total timeout
retryOn5xx,connect-failureOnly retry on transient failures
retryRemoteLocalitiestrueAlso retry on instances in other AZs
Rate Limiting Adoption Caution

Rate Limiting is a core element of resiliency alongside Circuit Breaker and Retry, but misconfiguration can block legitimate traffic. When implementing using Istio's EnvoyFilter or an external Rate Limiter (e.g., Redis-based), always adopt in phases: monitoring mode, then warning mode, then blocking mode.


6. Chaos Engineering

Chaos Engineering is a practical methodology for validating system resiliency in production environments. Test "when everything is normal" to prepare for "when failures occur."

AWS Fault Injection Service (FIS)

AWS FIS is a managed Chaos Engineering service that injects faults into AWS services such as EC2, EKS, and RDS.

Scenario 1: Pod Termination (Application Resilience Test)

{
"description": "EKS Pod termination test",
"targets": {
"eks-pods": {
"resourceType": "aws:eks:pod",
"resourceTags": {
"app": "critical-api"
},
"selectionMode": "COUNT(3)",
"parameters": {
"clusterIdentifier": "arn:aws:eks:us-east-1:123456789012:cluster/prod-cluster",
"namespace": "production"
}
}
},
"actions": {
"terminate-pods": {
"actionId": "aws:eks:pod-delete",
"targets": {
"Pods": "eks-pods"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighErrorRate"
}
]
}

Scenario 2: AZ Failure Simulation

{
"description": "Simulate AZ failure for EKS",
"targets": {
"eks-nodes-az1a": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"kubernetes.io/cluster/my-cluster": "owned"
},
"filters": [
{
"path": "Placement.AvailabilityZone",
"values": ["us-east-1a"]
}
],
"selectionMode": "ALL"
}
},
"actions": {
"stop-instances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {
"startInstancesAfterDuration": "PT10M"
},
"targets": {
"Instances": "eks-nodes-az1a"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:CriticalServiceDown"
}
]
}

Scenario 3: Network Latency Injection

{
"description": "Inject network latency to EKS nodes",
"targets": {
"eks-nodes": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"kubernetes.io/cluster/my-cluster": "owned",
"app-tier": "backend"
},
"selectionMode": "PERCENT(50)"
}
},
"actions": {
"inject-latency": {
"actionId": "aws:ssm:send-command",
"parameters": {
"documentArn": "arn:aws:ssm:us-east-1::document/AWSFIS-Run-Network-Latency",
"documentParameters": "{\"DurationSeconds\":\"300\",\"DelayMilliseconds\":\"200\",\"Interface\":\"eth0\"}",
"duration": "PT5M"
},
"targets": {
"Instances": "eks-nodes"
}
}
}
}

Litmus Chaos on EKS

Litmus is a CNCF incubating project and a Kubernetes-native Chaos Engineering framework.

Installation:

# Install Litmus ChaosCenter
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

helm install litmus litmuschaos/litmus \
--namespace litmus --create-namespace \
--set portal.frontend.service.type=LoadBalancer

ChaosEngine Example (Pod Delete):

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete-chaos
namespace: production
spec:
appinfo:
appns: production
applabel: "app=api-server"
appkind: deployment
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "false"
- name: PODS_AFFECTED_PERC
value: "50"

Chaos Mesh

Chaos Mesh is a CNCF incubating project and a Kubernetes-dedicated Chaos Engineering platform that supports a wide variety of fault types.

Installation:

# Install Chaos Mesh
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh --create-namespace \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.socketPath=/run/containerd/containerd.sock

NetworkChaos Example (Network Partition):

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-partition
namespace: chaos-mesh
spec:
action: partition
mode: all
selector:
namespaces:
- production
labelSelectors:
"app": "frontend"
direction: both
target:
selector:
namespaces:
- production
labelSelectors:
"app": "backend"
mode: all
duration: "5m"
scheduler:
cron: "@every 24h"

PodChaos Example (Pod Kill):

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-test
namespace: chaos-mesh
spec:
action: pod-kill
mode: fixed-percent
value: "30"
selector:
namespaces:
- production
labelSelectors:
"app": "api-server"
duration: "1m"
gracePeriod: 0

Chaos Engineering Tool Comparison

FeatureAWS FISLitmus ChaosChaos Mesh
TypeManaged serviceOpen source (CNCF)Open source (CNCF)
ScopeAWS infrastructure + K8sKubernetes onlyKubernetes only
Fault TypesEC2, EKS, RDS, networkPod, Node, network, DNSPod, network, I/O, time, JVM
AZ Failure SimulationNative supportLimited (Pod/Node level)Limited (Pod/Node level)
DashboardAWS ConsoleLitmus Portal (web UI)Chaos Dashboard (web UI)
CostPay per executionFree (infrastructure cost only)Free (infrastructure cost only)
Stop ConditionCloudWatch Alarm integrationManual / APIManual / API
Operational ComplexityLowMediumMedium
GitOps IntegrationCloudFormation / CDKCRD-based (ArgoCD compatible)CRD-based (ArgoCD compatible)
Recommended ForInfrastructure-level fault testingK8s-native testingWhen fine-grained fault injection is needed
Tool Selection Guide

Start with AWS FIS for infrastructure-level faults (AZ, network), then use Litmus or Chaos Mesh for fine-grained application-level faults. A hybrid approach is recommended. The Stop Condition feature in AWS FIS (based on CloudWatch Alarms) is essential for safe testing in production environments.

Game Day Runbook Template

A Game Day is an exercise where the team gathers to execute planned failure scenarios and discover weaknesses in systems and processes.

5-Phase Game Day Execution Framework:

Game Day Automation Script:

#!/bin/bash
# game-day.sh - Game Day execution automation
set -euo pipefail

CLUSTER_NAME=$1
SCENARIO=$2
NAMESPACE=${3:-production}

echo "============================================"
echo " Game Day: ${SCENARIO}"
echo " Cluster: ${CLUSTER_NAME}"
echo " Namespace: ${NAMESPACE}"
echo " Time: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"
echo "============================================"

# Phase 1: Record Steady State
echo ""
echo "[Phase 1] Recording Steady State..."
echo "--- Pod Status ---"
kubectl get pods -n ${NAMESPACE} -o wide | head -20

echo "--- Node Status ---"
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
STATUS:.status.conditions[-1].type,\
AZ:.metadata.labels.topology\\.kubernetes\\.io/zone

echo "--- Service Endpoints ---"
kubectl get endpoints -n ${NAMESPACE}

# Phase 2: Inject Failure (per scenario)
echo ""
echo "[Phase 2] Injecting failure: ${SCENARIO}..."

case ${SCENARIO} in
"az-failure")
echo "Simulating AZ failure with ARC Zonal Shift..."
# Execute ARC Zonal Shift (1 hour)
aws arc-zonal-shift start-zonal-shift \
--resource-identifier arn:aws:eks:us-east-1:$(aws sts get-caller-identity --query Account --output text):cluster/${CLUSTER_NAME} \
--away-from us-east-1a \
--expires-in 1h \
--comment "Game Day: AZ failure simulation"
;;

"pod-delete")
echo "Deleting 30% of pods in ${NAMESPACE}..."
TOTAL=$(kubectl get pods -n ${NAMESPACE} -l app=api-server --no-headers | wc -l)
DELETE_COUNT=$(( TOTAL * 30 / 100 ))
DELETE_COUNT=$(( DELETE_COUNT < 1 ? 1 : DELETE_COUNT ))
kubectl get pods -n ${NAMESPACE} -l app=api-server -o name | \
shuf | head -n ${DELETE_COUNT} | \
xargs kubectl delete -n ${NAMESPACE}
;;

"node-drain")
echo "Draining a random node..."
NODE=$(kubectl get nodes --no-headers | shuf -n 1 | awk '{print $1}')
kubectl cordon ${NODE}
kubectl drain ${NODE} --ignore-daemonsets --delete-emptydir-data --timeout=120s
;;

*)
echo "Unknown scenario: ${SCENARIO}"
echo "Available: az-failure, pod-delete, node-drain"
exit 1
;;
esac

# Phase 3: Observe Recovery
echo ""
echo "[Phase 3] Observing recovery..."
echo "Waiting 60 seconds for recovery..."
sleep 60

echo "--- Post-Failure Pod Status ---"
kubectl get pods -n ${NAMESPACE} -o wide | head -20

echo "--- Pod Restart Counts ---"
kubectl get pods -n ${NAMESPACE} -o custom-columns=\
NAME:.metadata.name,\
RESTARTS:.status.containerStatuses[0].restartCount,\
STATUS:.status.phase

echo ""
echo "============================================"
echo " Game Day Phase 3 Complete"
echo " Review results and proceed to analysis"
echo "============================================"

7. Resiliency Checklist & References

Resiliency Implementation Checklist

Use the following checklist to assess your current resiliency level and identify the next implementation steps.

Level 1 — Basic

ItemDescriptionDone
Liveness/Readiness Probe configurationAppropriate Probes configured for all Deployments[ ]
Resource Requests/Limits configurationCPU and Memory resource constraints specified[ ]
PodDisruptionBudget configurationMinimum available Pod count guaranteed[ ]
Graceful Shutdown implementationpreStop Hook + terminationGracePeriodSeconds[ ]
Startup Probe configurationInitialization protection for slow-starting applications[ ]
Auto-restart policyVerify restartPolicy: Always[ ]

Level 2 — Multi-AZ

ItemDescriptionDone
Topology Spread ConstraintsEven Pod distribution across AZs[ ]
Multi-AZ Karpenter NodePoolNode provisioning across 3+ AZs[ ]
WaitForFirstConsumer StorageClassPrevent EBS AZ-Pinning[ ]
ARC Zonal Shift enabledAutomatic traffic shift on AZ failure[ ]
Cross-AZ traffic optimizationLocality-Aware routing configured[ ]
AZ Evacuation runbook preparedEmergency AZ evacuation procedure documented[ ]

Level 3 — Cell-Based

ItemDescriptionDone
Cell boundary definitionCell configuration based on Namespace or Cluster[ ]
Cell Router implementationRequest routing to the appropriate Cell[ ]
Inter-Cell isolation verificationIsolation via NetworkPolicy or VPC level[ ]
Shuffle Sharding appliedDiversified Cell assignment per tenant[ ]
Cell Health MonitoringDashboard for monitoring individual Cell health[ ]
Cell Failover testingCell failure verified via Chaos Engineering[ ]

Level 4 — Multi-Region

ItemDescriptionDone
Multi-Region architecture designActive-Active or Active-Passive decision[ ]
Global Accelerator configurationInter-region traffic routing[ ]
Data replication strategyCross-Region data synchronization[ ]
ArgoCD Multi-Cluster GitOpsApplicationSet-based multi-cluster deployment[ ]
Multi-Region Chaos TestRegion failure simulation Game Day[ ]
RTO/RPO measurement and validationActual recovery time/data loss verified against targets[ ]

Cost Optimization Tips

Optimization AreaStrategyExpected Savings
Cross-AZ trafficIstio Locality-Aware routing to keep 80%+ traffic within the same AZ60-80% reduction in inter-AZ transfer cost
Spot instancesUse Spot for non-critical workloads (Karpenter capacity-type mix)60-90% reduction in compute cost
Cell utilizationDesign Cell sizes appropriately to minimize resource waste20-40% reduction in overprovisioning
Multi-RegionRun the Passive region at minimum capacity in Active-Passive50-70% reduction in Passive region cost
Karpenter consolidationWhenEmptyOrUnderutilized policy for automatic removal of unused nodesEliminate idle resource cost
Selective EFS usageUse EFS only when Cross-AZ is required; use EBS gp3 otherwiseStorage cost savings
Cost vs Resiliency Trade-off

Higher resiliency levels come with increased costs. Multi-Region Active-Active requires more than double the infrastructure cost compared to a single region. Balance business requirements (SLA, regulatory) with cost to select the appropriate resiliency level. Not every service needs to be at Level 4.

External References