EKS Debugging Guide

Created: 2026-02-10 | Updated: 2026-04-07 | Reading time: about 8 minutes

Baseline environment: EKS 1.32+, kubectl 1.30+, AWS CLI v2

1. Overview

Issues that occur during EKS operations span multiple layers including the control plane, nodes, network, workloads, storage, and observability. This document is a comprehensive debugging guide for SREs, DevOps engineers, and platform teams to systematically diagnose and quickly resolve these issues.

All commands and examples are written to be immediately executable, and decision trees and flowcharts help enable rapid judgment.

EKS Debugging Layers

Debugging Approach Methodology

Two approaches are available for EKS problem diagnosis.

Approach	Description	Suitable Situations
Top-down (symptom → cause)	Start from user-reported symptoms and trace back to causes	Immediate response to service outages and performance degradation
Bottom-up (infra → app)	Check layers sequentially starting from infrastructure	Preventive checks, validation after cluster migration

Generally Recommended Order

For production incidents, a Top-down approach is recommended. First identify the symptom (Section 2 Incident Triage), then navigate to the debugging section for that layer.

2. Incident Triage (Rapid Fault Determination)

First 5 Minutes Checklist

When an incident occurs, the most important actions are scope determination and initial response.

30 seconds: Initial diagnosis

# Check cluster status
aws eks describe-cluster --name <cluster-name> --query 'cluster.status' --output text

# Check node status
kubectl get nodes

# Check abnormal Pods
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

2 minutes: Scope determination

# Check recent events (all namespaces)
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

# Aggregate Pod states for a specific namespace
kubectl get pods -n <namespace> --no-headers | awk '{print $3}' | sort | uniq -c | sort -rn

# Check distribution of abnormal Pods by node
kubectl get pods --all-namespaces -o wide --field-selector=status.phase!=Running | \
  awk 'NR>1 {print $8}' | sort | uniq -c | sort -rn

5 minutes: Initial response

# Detailed information on the problem Pod
kubectl describe pod <pod-name> -n <namespace>

# Previous container logs (for CrashLoopBackOff)
kubectl logs <pod-name> -n <namespace> --previous

# Check resource usage
kubectl top nodes
kubectl top pods -n <namespace> --sort-by=cpu

Scope Determination Decision Tree

AZ Failure Detection

AWS Health API Requirements

The aws health describe-events API is only available on AWS Business or Enterprise Support plans. Without a Support plan, check directly from the AWS Health Dashboard console or capture Health events via EventBridge rules.

# Check EKS/EC2-related events via AWS Health API (Business/Enterprise Support plan required)
aws health describe-events \
  --filter '{"services":["EKS","EC2"],"eventStatusCodes":["open"]}' \
  --region us-east-1

# Alternative: AZ failure detection without Support plan — create EventBridge rule
aws events put-rule \
  --name "aws-health-eks-events" \
  --event-pattern '{
    "source": ["aws.health"],
    "detail-type": ["AWS Health Event"],
    "detail": {
      "service": ["EKS", "EC2"],
      "eventTypeCategory": ["issue"]
    }
  }'

# Aggregate abnormal Pods by AZ (only Pods scheduled to nodes)
kubectl get pods --all-namespaces -o json | jq -r '
  .items[] |
  select(.status.phase != "Running" and .status.phase != "Succeeded") |
  select(.spec.nodeName != null) |
  .spec.nodeName
' | sort -u | while read node; do
  zone=$(kubectl get node "$node" -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}' 2>/dev/null)
  [ -n "$zone" ] && echo "$zone"
done | sort | uniq -c | sort -rn

# Check ARC Zonal Shift status
aws arc-zonal-shift list-zonal-shifts \
  --resource-identifier arn:aws:eks:region:account:cluster/name

AZ Failure Response Using ARC Zonal Shift

# Enable Zonal Shift in EKS
aws eks update-cluster-config \
  --name <cluster-name> \
  --zonal-shift-config enabled=true

# Start manual Zonal Shift (move traffic away from failed AZ)
aws arc-zonal-shift start-zonal-shift \
  --resource-identifier arn:aws:eks:region:account:cluster/name \
  --away-from us-east-1a \
  --expires-in 3h \
  --comment "AZ impairment detected"

Zonal Shift Caveats

The maximum duration of a Zonal Shift is 3 days and can be extended. Starting a shift blocks new traffic to Pods running on nodes in that AZ, so first verify that other AZs have sufficient capacity.

Zonal Shift Only Blocks Traffic

ARC Zonal Shift only changes Load Balancer / Service-level traffic routing.

⚡ ARC Zonal Shift 영향 범위

Zonal Shift는 트래픽 라우팅만 변경합니다 — 각 계층별 영향 확인

계층Zonal Shift 영향자동 조정수동 작업

🔀 ALB / NLB해당 AZ Target Group에서 제거✅-

🔀 EKS Service (kube-proxy)해당 AZ의 Endpoint 가중치 제거✅-

💻 기존 노드계속 실행됨❌kubectl drain 으로 Pod 이동

📦 기존 Pod트래픽만 차단, Pod 자체는 실행 중❌drain 시 자동 재배치

⚙️ Karpenter NodePoolAZ 설정 변경 없음, 해당 AZ에 새 노드 생성 가능❌NodePool requirements 수정

📊 ASG (Managed Node Group)서브넷 목록 변경 없음, 해당 AZ에 스케일아웃 가능❌ASG 서브넷 수정 (콘솔/IaC)

💾 EBS 볼륨AZ에 고정, 이동 불가❌스냅샷 → 다른 AZ에 복원

📁 EFS Mount Target다른 AZ의 Mount Target 자동 사용✅-

AZ settings for Karpenter NodePools and ASGs (Managed Node Groups) are not updated automatically. Therefore, complete AZ evacuation requires additional actions:

Start Zonal Shift → Block new traffic (automatic)
Drain nodes in that AZ → Move existing Pods
Remove the AZ from the Karpenter NodePool or ASG subnets → Prevent new node provisioning

# 1. Identify and drain nodes in the failed AZ
for node in $(kubectl get nodes -l topology.kubernetes.io/zone=us-east-1a -o name); do
  kubectl cordon $node
  kubectl drain $node --ignore-daemonsets --delete-emptydir-data --grace-period=60
done

# 2. Temporarily exclude the AZ from the Karpenter NodePool (modify requirements)
kubectl patch nodepool default --type=merge -p '{
  "spec": {"template": {"spec": {"requirements": [
    {"key": "topology.kubernetes.io/zone", "operator": "In", "values": ["us-east-1b", "us-east-1c"]}
  ]}}}
}'

# 3. Managed Node Groups require ASG subnet changes (perform via console or IaC)

After the Zonal Shift is lifted, the above changes must be reverted.

CloudWatch Anomaly Detection

# Configure an Anomaly Detection alarm for Pod restart counts
aws cloudwatch put-anomaly-detector \
  --single-metric-anomaly-detector '{
    "Namespace": "ContainerInsights",
    "MetricName": "pod_number_of_container_restarts",
    "Dimensions": [
      {"Name": "ClusterName", "Value": "<cluster-name>"},
      {"Name": "Namespace", "Value": "production"}
    ],
    "Stat": "Average"
  }'

Incident Response Escalation Matrix

🚨 인시던트 대응 에스컬레이션 매트릭스

심각도별 초동 대응 시간 및 에스컬레이션 경로

🔴P1 - Critical⏱ 5분 이내

에스컬레이션: 즉시 온콜 + 관리자컨트롤 플레인 장애, 전체 노드 NotReady

🟠P2 - High⏱ 15분 이내

에스컬레이션: 온콜 팀특정 AZ 장애, 다수 Pod CrashLoopBackOff

🟡P3 - Medium⏱ 1시간 이내

에스컬레이션: 담당 팀HPA 스케일링 실패, 간헐적 타임아웃

🔵P4 - Low⏱ 4시간 이내

에스컬레이션: 백로그단일 Pod 재시작, 비프로덕션 환경 이슈

See High-Availability Architecture Guide

For architecture-level fault recovery strategies (TopologySpreadConstraints, PodDisruptionBudget, multi-AZ deployment, etc.), see the EKS Resiliency Guide.

10. Debugging Quick Reference

Error Pattern → Cause → Resolution Quick Reference Table

🔍 에러 패턴 Quick Reference

에러 패턴 → 원인 → 해결 빠른 참조 (18건)

1CrashLoopBackOff

앱 크래시, 잘못된 설정, 의존성 미충족

→ kubectl logs --previous, 앱 설정/환경변수 점검

2ImagePullBackOff

이미지 미존재, 레지스트리 인증 실패

→ 이미지 이름/태그 확인, imagePullSecrets 설정

3OOMKilled

메모리 limits 초과

→ 메모리 limits 증가, 앱 메모리 누수 점검

4Pending (스케줄링 불가)

리소스 부족, nodeSelector 불일치

→ kubectl describe pod 이벤트 확인, 노드 용량/라벨 점검

5CreateContainerConfigError

ConfigMap/Secret 미존재

→ 참조되는 ConfigMap/Secret 존재 여부 확인

6Node NotReady

kubelet 장애, 리소스 압박

→ SSM으로 노드 접속, systemctl status kubelet

7FailedAttachVolume

EBS 볼륨 다른 노드에 연결됨

→ 이전 Pod 삭제, 볼륨 detach 대기 (~6분)

8FailedMount

EFS mount target/SG 설정 오류

→ mount target 존재 및 TCP 2049 허용 확인

9NetworkNotReady

VPC CNI 미시작

→ kubectl logs -n kube-system -l k8s-app=aws-node

10DNS resolution failed

CoreDNS 장애

→ CoreDNS Pod 상태/로그 확인, kubectl rollout restart

11Unauthorized / 403

RBAC 권한 부족, aws-auth 설정 오류

→ aws sts get-caller-identity, aws-auth/Access Entry 확인

12connection refused

Service Endpoint 없음, 포트 불일치

→ kubectl get endpoints, selector 및 포트 확인

13Evicted

노드 리소스 압박 (DiskPressure 등)

→ 노드 디스크 정리, Pod resource requests 조정

14FailedScheduling: Insufficient cpu/memory

클러스터 용량 부족

→ Karpenter NodePool limits 증가, 노드 추가

15Terminating (stuck)

Finalizer 미완료, preStop hook 지연

→ Finalizer 확인, 필요시 --force --grace-period=0

16Back-off pulling image

이미지 크기 큰 경우 pull 타임아웃

→ 이미지 최적화, ECR 같은 리전 레지스트리 사용

17readiness probe failed

앱 시작 지연, 헬스체크 엔드포인트 오류

→ startupProbe 추가, probe 타임아웃 조정

18Too many pods

노드당 최대 Pod 수 초과

→ max-pods 설정 확인, Prefix Delegation 활성화

Essential kubectl Command Cheat Sheet

Query and diagnosis

# See all resource status at a glance
kubectl get all -n <namespace>

# Filter only abnormal Pods
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded

# Pod details (including events)
kubectl describe pod <pod-name> -n <namespace>

# Namespace events (sorted newest first)
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Resource usage
kubectl top nodes
kubectl top pods -n <namespace> --sort-by=memory

Log inspection

# Current container logs
kubectl logs <pod-name> -n <namespace>

# Previous (crashed) container logs
kubectl logs <pod-name> -n <namespace> --previous

# Specific container in a multi-container Pod
kubectl logs <pod-name> -n <namespace> -c <container-name>

# Real-time log streaming
kubectl logs -f <pod-name> -n <namespace>

# Logs from multiple Pods by label
kubectl logs -l app=<app-name> -n <namespace> --tail=50

Debugging

# Debug with an ephemeral container
kubectl debug <pod-name> -it --image=nicolaka/netshoot --target=<container-name>

# Node debugging
kubectl debug node/<node-name> -it --image=ubuntu

# Execute a command inside a Pod
kubectl exec -it <pod-name> -n <namespace> -- <command>

Deployment management

# Rollout status/history/rollback
kubectl rollout status deployment/<name>
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name>

# Restart a Deployment
kubectl rollout restart deployment/<name>

# Node maintenance (drain)
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl uncordon <node-name>

Recommended Tool Matrix

Scenario	Tool	Description
Network debugging	netshoot	Container bundled with networking tools
Node resource visualization	eks-node-viewer	Terminal-based node resource monitoring
Container runtime debugging	crictl	containerd debugging CLI
Log analysis	CloudWatch Logs Insights	AWS-native log query
Metric queries	Prometheus / Grafana	PromQL-based metric analysis
Distributed tracing	ADOT / OpenTelemetry	Request path tracing
Cluster security scanning	kube-bench	CIS Benchmark-based security scan
YAML manifest validation	kubeval / kubeconform	Pre-deployment manifest validation
Karpenter debugging	Karpenter controller logs	Diagnose node provisioning issues
IAM debugging	AWS IAM Policy Simulator	Validate IAM permissions

EKS Log Collector

EKS Log Collector is a script provided by AWS that automatically collects logs needed for debugging from EKS worker nodes and generates an archive file that can be shared with AWS Support.

Installation and execution:

# Download and run the script (after SSM connecting to the node)
curl -O https://raw.githubusercontent.com/awslabs/amazon-eks-ami/master/log-collector-script/linux/eks-log-collector.sh
sudo bash eks-log-collector.sh

Collected items:

kubelet logs
containerd logs
iptables rules
CNI config (VPC CNI configuration)
cloud-init logs
dmesg (kernel messages)
systemd unit status

Output:

Collected logs are saved in a compressed archive following the format /var/log/eks_i-xxxx_yyyy-mm-dd_HH-MM-SS.tar.gz.

S3 upload:

# Upload collected logs directly to S3
sudo bash eks-log-collector.sh --upload s3://my-bucket/

Leveraging AWS Support

Attaching this log file when submitting an AWS Support case enables support engineers to quickly understand node state, significantly reducing time to resolution. Always attach it when reporting node join failures, kubelet failures, or network issues.

Detailed Debugging Guides

Use the following links to view detailed debugging guides for each layer:

Document	Description	Key Topics
Control Plane Debugging	Diagnose EKS control plane issues	API Server logs, AuthN/AuthZ, Add-ons, IRSA, Pod Identity, RBAC
Node Debugging	Diagnose node-level issues	Node join failures, kubelet/containerd, resource pressure, Karpenter, Managed Node Group
Workload Debugging	Diagnose Pod and workload issues	Pod state-based debugging, Deployment, HPA/VPA, Probe configuration
Networking Debugging	Diagnose network issues	VPC CNI, DNS, Service, NetworkPolicy, Ingress/LoadBalancer
Storage Debugging	Diagnose storage issues	EBS CSI, EFS CSI, PV/PVC status, volume mount failures
Observability	Monitoring and log analysis	Container Insights, Prometheus, CloudWatch Logs Insights, ADOT

EKS Resiliency Guide - Architecture-level fault recovery strategies
GitOps-based EKS Cluster Operations - GitOps deployment and operations automation
Ultra-fast Autoscaling with Karpenter - Karpenter-based node provisioning optimization
Node Monitoring Agent - Node-level monitoring

1. Overview​

EKS Debugging Layers​

Debugging Approach Methodology​

2. Incident Triage (Rapid Fault Determination)​

First 5 Minutes Checklist​

30 seconds: Initial diagnosis​

2 minutes: Scope determination​

5 minutes: Initial response​

Scope Determination Decision Tree​

AZ Failure Detection​

AZ Failure Response Using ARC Zonal Shift​

CloudWatch Anomaly Detection​

Incident Response Escalation Matrix​

10. Debugging Quick Reference​

Error Pattern → Cause → Resolution Quick Reference Table​

Essential kubectl Command Cheat Sheet​

Query and diagnosis​

Log inspection​

Debugging​

Deployment management​

Recommended Tool Matrix​

EKS Log Collector​

Detailed Debugging Guides​

Related Documents​

References​