Skip to main content

EKS Debugging Guide

Created: 2026-02-10 | Updated: 2026-04-07 | Reading time: about 8 minutes

Baseline environment: EKS 1.32+, kubectl 1.30+, AWS CLI v2

1. Overview

Issues that occur during EKS operations span multiple layers including the control plane, nodes, network, workloads, storage, and observability. This document is a comprehensive debugging guide for SREs, DevOps engineers, and platform teams to systematically diagnose and quickly resolve these issues.

All commands and examples are written to be immediately executable, and decision trees and flowcharts help enable rapid judgment.

EKS Debugging Layers

Debugging Approach Methodology

Two approaches are available for EKS problem diagnosis.

ApproachDescriptionSuitable Situations
Top-down (symptom → cause)Start from user-reported symptoms and trace back to causesImmediate response to service outages and performance degradation
Bottom-up (infra → app)Check layers sequentially starting from infrastructurePreventive checks, validation after cluster migration
Generally Recommended Order

For production incidents, a Top-down approach is recommended. First identify the symptom (Section 2 Incident Triage), then navigate to the debugging section for that layer.


2. Incident Triage (Rapid Fault Determination)

First 5 Minutes Checklist

When an incident occurs, the most important actions are scope determination and initial response.

30 seconds: Initial diagnosis

# Check cluster status
aws eks describe-cluster --name <cluster-name> --query 'cluster.status' --output text

# Check node status
kubectl get nodes

# Check abnormal Pods
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

2 minutes: Scope determination

# Check recent events (all namespaces)
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

# Aggregate Pod states for a specific namespace
kubectl get pods -n <namespace> --no-headers | awk '{print $3}' | sort | uniq -c | sort -rn

# Check distribution of abnormal Pods by node
kubectl get pods --all-namespaces -o wide --field-selector=status.phase!=Running | \
awk 'NR>1 {print $8}' | sort | uniq -c | sort -rn

5 minutes: Initial response

# Detailed information on the problem Pod
kubectl describe pod <pod-name> -n <namespace>

# Previous container logs (for CrashLoopBackOff)
kubectl logs <pod-name> -n <namespace> --previous

# Check resource usage
kubectl top nodes
kubectl top pods -n <namespace> --sort-by=cpu

Scope Determination Decision Tree

AZ Failure Detection

AWS Health API Requirements

The aws health describe-events API is only available on AWS Business or Enterprise Support plans. Without a Support plan, check directly from the AWS Health Dashboard console or capture Health events via EventBridge rules.

# Check EKS/EC2-related events via AWS Health API (Business/Enterprise Support plan required)
aws health describe-events \
--filter '{"services":["EKS","EC2"],"eventStatusCodes":["open"]}' \
--region us-east-1

# Alternative: AZ failure detection without Support plan — create EventBridge rule
aws events put-rule \
--name "aws-health-eks-events" \
--event-pattern '{
"source": ["aws.health"],
"detail-type": ["AWS Health Event"],
"detail": {
"service": ["EKS", "EC2"],
"eventTypeCategory": ["issue"]
}
}'

# Aggregate abnormal Pods by AZ (only Pods scheduled to nodes)
kubectl get pods --all-namespaces -o json | jq -r '
.items[] |
select(.status.phase != "Running" and .status.phase != "Succeeded") |
select(.spec.nodeName != null) |
.spec.nodeName
' | sort -u | while read node; do
zone=$(kubectl get node "$node" -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}' 2>/dev/null)
[ -n "$zone" ] && echo "$zone"
done | sort | uniq -c | sort -rn

# Check ARC Zonal Shift status
aws arc-zonal-shift list-zonal-shifts \
--resource-identifier arn:aws:eks:region:account:cluster/name

AZ Failure Response Using ARC Zonal Shift

# Enable Zonal Shift in EKS
aws eks update-cluster-config \
--name <cluster-name> \
--zonal-shift-config enabled=true

# Start manual Zonal Shift (move traffic away from failed AZ)
aws arc-zonal-shift start-zonal-shift \
--resource-identifier arn:aws:eks:region:account:cluster/name \
--away-from us-east-1a \
--expires-in 3h \
--comment "AZ impairment detected"
Zonal Shift Caveats

The maximum duration of a Zonal Shift is 3 days and can be extended. Starting a shift blocks new traffic to Pods running on nodes in that AZ, so first verify that other AZs have sufficient capacity.

Zonal Shift Only Blocks Traffic

ARC Zonal Shift only changes Load Balancer / Service-level traffic routing.

⚡ ARC Zonal Shift 영향 범위
Zonal Shift는 트래픽 라우팅만 변경합니다 — 각 계층별 영향 확인
계층Zonal Shift 영향자동 조정수동 작업
🔀 ALB / NLB해당 AZ Target Group에서 제거-
🔀 EKS Service (kube-proxy)해당 AZ의 Endpoint 가중치 제거-
💻 기존 노드계속 실행됨kubectl drain 으로 Pod 이동
📦 기존 Pod트래픽만 차단, Pod 자체는 실행 중drain 시 자동 재배치
⚙️ Karpenter NodePoolAZ 설정 변경 없음, 해당 AZ에 새 노드 생성 가능NodePool requirements 수정
📊 ASG (Managed Node Group)서브넷 목록 변경 없음, 해당 AZ에 스케일아웃 가능ASG 서브넷 수정 (콘솔/IaC)
💾 EBS 볼륨AZ에 고정, 이동 불가스냅샷 → 다른 AZ에 복원
📁 EFS Mount Target다른 AZ의 Mount Target 자동 사용-

AZ settings for Karpenter NodePools and ASGs (Managed Node Groups) are not updated automatically. Therefore, complete AZ evacuation requires additional actions:

  1. Start Zonal Shift → Block new traffic (automatic)
  2. Drain nodes in that AZ → Move existing Pods
  3. Remove the AZ from the Karpenter NodePool or ASG subnets → Prevent new node provisioning
# 1. Identify and drain nodes in the failed AZ
for node in $(kubectl get nodes -l topology.kubernetes.io/zone=us-east-1a -o name); do
kubectl cordon $node
kubectl drain $node --ignore-daemonsets --delete-emptydir-data --grace-period=60
done

# 2. Temporarily exclude the AZ from the Karpenter NodePool (modify requirements)
kubectl patch nodepool default --type=merge -p '{
"spec": {"template": {"spec": {"requirements": [
{"key": "topology.kubernetes.io/zone", "operator": "In", "values": ["us-east-1b", "us-east-1c"]}
]}}}
}'

# 3. Managed Node Groups require ASG subnet changes (perform via console or IaC)

After the Zonal Shift is lifted, the above changes must be reverted.

CloudWatch Anomaly Detection

# Configure an Anomaly Detection alarm for Pod restart counts
aws cloudwatch put-anomaly-detector \
--single-metric-anomaly-detector '{
"Namespace": "ContainerInsights",
"MetricName": "pod_number_of_container_restarts",
"Dimensions": [
{"Name": "ClusterName", "Value": "<cluster-name>"},
{"Name": "Namespace", "Value": "production"}
],
"Stat": "Average"
}'

Incident Response Escalation Matrix

🚨 인시던트 대응 에스컬레이션 매트릭스
심각도별 초동 대응 시간 및 에스컬레이션 경로
🔴P1 - Critical5분 이내
에스컬레이션: 즉시 온콜 + 관리자컨트롤 플레인 장애, 전체 노드 NotReady
🟠P2 - High15분 이내
에스컬레이션: 온콜 팀특정 AZ 장애, 다수 Pod CrashLoopBackOff
🟡P3 - Medium1시간 이내
에스컬레이션: 담당 팀HPA 스케일링 실패, 간헐적 타임아웃
🔵P4 - Low4시간 이내
에스컬레이션: 백로그단일 Pod 재시작, 비프로덕션 환경 이슈
See High-Availability Architecture Guide

For architecture-level fault recovery strategies (TopologySpreadConstraints, PodDisruptionBudget, multi-AZ deployment, etc.), see the EKS Resiliency Guide.


10. Debugging Quick Reference

Error Pattern → Cause → Resolution Quick Reference Table

🔍 에러 패턴 Quick Reference
에러 패턴 → 원인 → 해결 빠른 참조 (18건)
1CrashLoopBackOff
앱 크래시, 잘못된 설정, 의존성 미충족
kubectl logs --previous, 앱 설정/환경변수 점검
2ImagePullBackOff
이미지 미존재, 레지스트리 인증 실패
이미지 이름/태그 확인, imagePullSecrets 설정
3OOMKilled
메모리 limits 초과
메모리 limits 증가, 앱 메모리 누수 점검
4Pending (스케줄링 불가)
리소스 부족, nodeSelector 불일치
kubectl describe pod 이벤트 확인, 노드 용량/라벨 점검
5CreateContainerConfigError
ConfigMap/Secret 미존재
참조되는 ConfigMap/Secret 존재 여부 확인
6Node NotReady
kubelet 장애, 리소스 압박
SSM으로 노드 접속, systemctl status kubelet
7FailedAttachVolume
EBS 볼륨 다른 노드에 연결됨
이전 Pod 삭제, 볼륨 detach 대기 (~6분)
8FailedMount
EFS mount target/SG 설정 오류
mount target 존재 및 TCP 2049 허용 확인
9NetworkNotReady
VPC CNI 미시작
kubectl logs -n kube-system -l k8s-app=aws-node
10DNS resolution failed
CoreDNS 장애
CoreDNS Pod 상태/로그 확인, kubectl rollout restart
11Unauthorized / 403
RBAC 권한 부족, aws-auth 설정 오류
aws sts get-caller-identity, aws-auth/Access Entry 확인
12connection refused
Service Endpoint 없음, 포트 불일치
kubectl get endpoints, selector 및 포트 확인
13Evicted
노드 리소스 압박 (DiskPressure 등)
노드 디스크 정리, Pod resource requests 조정
14FailedScheduling: Insufficient cpu/memory
클러스터 용량 부족
Karpenter NodePool limits 증가, 노드 추가
15Terminating (stuck)
Finalizer 미완료, preStop hook 지연
Finalizer 확인, 필요시 --force --grace-period=0
16Back-off pulling image
이미지 크기 큰 경우 pull 타임아웃
이미지 최적화, ECR 같은 리전 레지스트리 사용
17readiness probe failed
앱 시작 지연, 헬스체크 엔드포인트 오류
startupProbe 추가, probe 타임아웃 조정
18Too many pods
노드당 최대 Pod 수 초과
max-pods 설정 확인, Prefix Delegation 활성화

Essential kubectl Command Cheat Sheet

Query and diagnosis

# See all resource status at a glance
kubectl get all -n <namespace>

# Filter only abnormal Pods
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded

# Pod details (including events)
kubectl describe pod <pod-name> -n <namespace>

# Namespace events (sorted newest first)
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Resource usage
kubectl top nodes
kubectl top pods -n <namespace> --sort-by=memory

Log inspection

# Current container logs
kubectl logs <pod-name> -n <namespace>

# Previous (crashed) container logs
kubectl logs <pod-name> -n <namespace> --previous

# Specific container in a multi-container Pod
kubectl logs <pod-name> -n <namespace> -c <container-name>

# Real-time log streaming
kubectl logs -f <pod-name> -n <namespace>

# Logs from multiple Pods by label
kubectl logs -l app=<app-name> -n <namespace> --tail=50

Debugging

# Debug with an ephemeral container
kubectl debug <pod-name> -it --image=nicolaka/netshoot --target=<container-name>

# Node debugging
kubectl debug node/<node-name> -it --image=ubuntu

# Execute a command inside a Pod
kubectl exec -it <pod-name> -n <namespace> -- <command>

Deployment management

# Rollout status/history/rollback
kubectl rollout status deployment/<name>
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name>

# Restart a Deployment
kubectl rollout restart deployment/<name>

# Node maintenance (drain)
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl uncordon <node-name>
ScenarioToolDescription
Network debuggingnetshootContainer bundled with networking tools
Node resource visualizationeks-node-viewerTerminal-based node resource monitoring
Container runtime debuggingcrictlcontainerd debugging CLI
Log analysisCloudWatch Logs InsightsAWS-native log query
Metric queriesPrometheus / GrafanaPromQL-based metric analysis
Distributed tracingADOT / OpenTelemetryRequest path tracing
Cluster security scanningkube-benchCIS Benchmark-based security scan
YAML manifest validationkubeval / kubeconformPre-deployment manifest validation
Karpenter debuggingKarpenter controller logsDiagnose node provisioning issues
IAM debuggingAWS IAM Policy SimulatorValidate IAM permissions

EKS Log Collector

EKS Log Collector is a script provided by AWS that automatically collects logs needed for debugging from EKS worker nodes and generates an archive file that can be shared with AWS Support.

Installation and execution:

# Download and run the script (after SSM connecting to the node)
curl -O https://raw.githubusercontent.com/awslabs/amazon-eks-ami/master/log-collector-script/linux/eks-log-collector.sh
sudo bash eks-log-collector.sh

Collected items:

  • kubelet logs
  • containerd logs
  • iptables rules
  • CNI config (VPC CNI configuration)
  • cloud-init logs
  • dmesg (kernel messages)
  • systemd unit status

Output:

Collected logs are saved in a compressed archive following the format /var/log/eks_i-xxxx_yyyy-mm-dd_HH-MM-SS.tar.gz.

S3 upload:

# Upload collected logs directly to S3
sudo bash eks-log-collector.sh --upload s3://my-bucket/
Leveraging AWS Support

Attaching this log file when submitting an AWS Support case enables support engineers to quickly understand node state, significantly reducing time to resolution. Always attach it when reporting node join failures, kubelet failures, or network issues.


Detailed Debugging Guides

Use the following links to view detailed debugging guides for each layer:

DocumentDescriptionKey Topics
Control Plane DebuggingDiagnose EKS control plane issuesAPI Server logs, AuthN/AuthZ, Add-ons, IRSA, Pod Identity, RBAC
Node DebuggingDiagnose node-level issuesNode join failures, kubelet/containerd, resource pressure, Karpenter, Managed Node Group
Workload DebuggingDiagnose Pod and workload issuesPod state-based debugging, Deployment, HPA/VPA, Probe configuration
Networking DebuggingDiagnose network issuesVPC CNI, DNS, Service, NetworkPolicy, Ingress/LoadBalancer
Storage DebuggingDiagnose storage issuesEBS CSI, EFS CSI, PV/PVC status, volume mount failures
ObservabilityMonitoring and log analysisContainer Insights, Prometheus, CloudWatch Logs Insights, ADOT

References