📅 Written: 2026-02-10 | Last Modified: 2026-02-13 | ⏱️ Reading Time: ~20 min

EKS Incident Diagnosis and Response Guide

📅 Published: 2026-02-10 | ⏱️ Reading time: ~25 min

📌 Reference environment: EKS 1.30+, kubectl 1.30+, AWS CLI v2

1. Overview

Issues that arise during EKS operations span multiple layers including the control plane, nodes, networking, workloads, storage, and observability. This document is a comprehensive debugging guide designed to help SREs, DevOps engineers, and platform teams systematically diagnose and quickly resolve these issues.

All commands and examples are written to be immediately executable, and Decision Trees and flowcharts are provided to facilitate rapid decision-making.

EKS Debugging Layers

Debugging Methodology

There are two approaches to diagnosing EKS issues.

Approach	Description	Suitable Scenarios
Top-down (Symptom → Cause)	Start from user-reported symptoms and trace back to the root cause	Immediate incident response such as service outages or performance degradation
Bottom-up (Infra → App)	Inspect sequentially from the infrastructure layer upward	Preventive inspections, post-migration validation

Recommended General Approach

For production incidents, the top-down approach is recommended. First identify the symptoms (Section 2 - Incident Triage), then navigate to the corresponding debugging section for that layer.

2. Incident Triage (Rapid Failure Assessment)

First 5 Minutes Checklist

When an incident occurs, the most important actions are scope identification and initial response.

30 Seconds: Initial Diagnosis

# Check cluster status
aws eks describe-cluster --name <cluster-name> --query 'cluster.status' --output text

# Check node status
kubectl get nodes

# Check unhealthy Pods
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

2 Minutes: Scope Identification

# Check recent events (all namespaces)
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

# Aggregate Pod status in a specific namespace
kubectl get pods -n <namespace> --no-headers | awk '{print $3}' | sort | uniq -c | sort -rn

# Check distribution of unhealthy Pods by node
kubectl get pods --all-namespaces -o wide --field-selector=status.phase!=Running | \
  awk 'NR>1 {print $8}' | sort | uniq -c | sort -rn

5 Minutes: Initial Response

# Detailed information for the problematic Pod
kubectl describe pod <pod-name> -n <namespace>

# Previous container logs (for CrashLoopBackOff)
kubectl logs <pod-name> -n <namespace> --previous

# Check resource usage
kubectl top nodes
kubectl top pods -n <namespace> --sort-by=cpu

Scope Identification Decision Tree

AZ Failure Detection

AWS Health API Requirement

The aws health describe-events API requires an AWS Business or Enterprise Support plan. If you don't have a Support plan, check the AWS Health Dashboard console directly or create an EventBridge rule to capture Health events.

# Check AWS Health API for EKS/EC2 events (requires Business/Enterprise Support plan)
aws health describe-events \
  --filter '{"services":["EKS","EC2"],"eventStatusCodes":["open"]}' \
  --region us-east-1

# Alternative: Detect AZ failures without a Support plan — create EventBridge rule
aws events put-rule \
  --name "aws-health-eks-events" \
  --event-pattern '{
    "source": ["aws.health"],
    "detail-type": ["AWS Health Event"],
    "detail": {
      "service": ["EKS", "EC2"],
      "eventTypeCategory": ["issue"]
    }
  }'

# Aggregate unhealthy Pods by AZ (only pods scheduled to a node)
kubectl get pods --all-namespaces -o json | jq -r '
  .items[] |
  select(.status.phase != "Running" and .status.phase != "Succeeded") |
  select(.spec.nodeName != null) |
  .spec.nodeName
' | sort -u | while read node; do
  zone=$(kubectl get node "$node" -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}' 2>/dev/null)
  [ -n "$zone" ] && echo "$zone"
done | sort | uniq -c | sort -rn

# Check ARC Zonal Shift status
aws arc-zonal-shift list-zonal-shifts \
  --resource-identifier arn:aws:eks:region:account:cluster/name

AZ Failure Response Using ARC Zonal Shift

# Enable Zonal Shift on EKS
aws eks update-cluster-config \
  --name <cluster-name> \
  --zonal-shift-config enabled=true

# Start manual Zonal Shift (move traffic away from impaired AZ)
aws arc-zonal-shift start-zonal-shift \
  --resource-identifier arn:aws:eks:region:account:cluster/name \
  --away-from us-east-1a \
  --expires-in 3h \
  --comment "AZ impairment detected"

Zonal Shift Considerations

The maximum duration for a Zonal Shift is 3 days and can be extended. Once a Shift is initiated, new traffic to Pods running on nodes in the affected AZ is blocked, so verify that sufficient capacity exists in the other AZs before proceeding.

Zonal Shift Only Blocks Traffic

ARC Zonal Shift only changes traffic routing at the Load Balancer / Service level.

⚡ ARC Zonal Shift 영향 범위

Zonal Shift는 트래픽 라우팅만 변경합니다 — 각 계층별 영향 확인

계층Zonal Shift 영향자동 조정수동 작업

🔀 ALB / NLB해당 AZ Target Group에서 제거✅-

🔀 EKS Service (kube-proxy)해당 AZ의 Endpoint 가중치 제거✅-

💻 기존 노드계속 실행됨❌kubectl drain 으로 Pod 이동

📦 기존 Pod트래픽만 차단, Pod 자체는 실행 중❌drain 시 자동 재배치

⚙️ Karpenter NodePoolAZ 설정 변경 없음, 해당 AZ에 새 노드 생성 가능❌NodePool requirements 수정

📊 ASG (Managed Node Group)서브넷 목록 변경 없음, 해당 AZ에 스케일아웃 가능❌ASG 서브넷 수정 (콘솔/IaC)

💾 EBS 볼륨AZ에 고정, 이동 불가❌스냅샷 → 다른 AZ에 복원

📁 EFS Mount Target다른 AZ의 Mount Target 자동 사용✅-

Karpenter NodePool and ASG (Managed Node Group) AZ configurations are NOT automatically updated. A complete AZ evacuation requires additional steps:

Start Zonal Shift → blocks new traffic (automatic)
Drain nodes in the affected AZ → relocate existing Pods
Remove the affected AZ from Karpenter NodePool or ASG subnets → prevent new node provisioning

# 1. Identify and drain nodes in the affected AZ
for node in $(kubectl get nodes -l topology.kubernetes.io/zone=us-east-1a -o name); do
  kubectl cordon $node
  kubectl drain $node --ignore-daemonsets --delete-emptydir-data --grace-period=60
done

# 2. Temporarily exclude the affected AZ from Karpenter NodePool
kubectl patch nodepool default --type=merge -p '{
  "spec": {"template": {"spec": {"requirements": [
    {"key": "topology.kubernetes.io/zone", "operator": "In", "values": ["us-east-1b", "us-east-1c"]}
  ]}}}
}'

# 3. For Managed Node Groups, update ASG subnets (via console or IaC)

Remember to revert these changes after the Zonal Shift is cancelled.

CloudWatch Anomaly Detection

# Set up Anomaly Detection alarm for Pod restart counts
aws cloudwatch put-anomaly-detector \
  --single-metric-anomaly-detector '{
    "Namespace": "ContainerInsights",
    "MetricName": "pod_number_of_container_restarts",
    "Dimensions": [
      {"Name": "ClusterName", "Value": "<cluster-name>"},
      {"Name": "Namespace", "Value": "production"}
    ],
    "Stat": "Average"
  }'

Incident Response Escalation Matrix

🚨 인시던트 대응 에스컬레이션 매트릭스

심각도별 초동 대응 시간 및 에스컬레이션 경로

🔴P1 - Critical⏱ 5분 이내

에스컬레이션: 즉시 온콜 + 관리자컨트롤 플레인 장애, 전체 노드 NotReady

🟠P2 - High⏱ 15분 이내

에스컬레이션: 온콜 팀특정 AZ 장애, 다수 Pod CrashLoopBackOff

🟡P3 - Medium⏱ 1시간 이내

에스컬레이션: 담당 팀HPA 스케일링 실패, 간헐적 타임아웃

🔵P4 - Low⏱ 4시간 이내

에스컬레이션: 백로그단일 Pod 재시작, 비프로덕션 환경 이슈

High Availability Architecture Guide Reference

For architecture-level failure recovery strategies (TopologySpreadConstraints, PodDisruptionBudget, Multi-AZ deployments, etc.), refer to the EKS High Availability Architecture Guide.

3. EKS Control Plane Debugging

Control Plane Log Types

The EKS control plane can send 5 log types to CloudWatch Logs.

📋 EKS 컨트롤 플레인 로그 타입

로그 그룹: /aws/eks/<cluster-name>/cluster

apikube-apiserver

API 요청/응답 기록kube-apiserver-audit-*

auditkube-apiserver-audit

감사 로그 (누가, 무엇을, 언제)kube-apiserver-audit-*

authenticatoraws-iam-authenticator

IAM 인증 이벤트authenticator-*

controllerManagerkube-controller-manager

컨트롤러 동작 로그kube-controller-manager-*

schedulerkube-scheduler

스케줄링 결정 및 실패scheduler-*

Enabling Logs

# Enable all control plane logs
aws eks update-cluster-config \
  --region <region> \
  --name <cluster-name> \
  --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'

Cost Optimization

Enabling all log types increases CloudWatch Logs costs. For production, it is recommended to enable audit and authenticator as mandatory, and enable the remaining types only when debugging is needed.

CloudWatch Logs Insights Queries

API Server Errors (400+) Analysis

fields @timestamp, @message
| filter @logStream like /kube-apiserver-audit/
| filter responseStatus.code >= 400
| stats count() by responseStatus.code
| sort count desc

Authentication Failure Tracking

fields @timestamp, @message
| filter @logStream like /authenticator/
| filter @message like /error/ or @message like /denied/
| sort @timestamp desc

aws-auth ConfigMap Change Detection

fields @timestamp, @message
| filter @logStream like /kube-apiserver-audit/
| filter objectRef.resource = "configmaps" and objectRef.name = "aws-auth"
| filter verb in ["update", "patch", "delete"]
| sort @timestamp desc

API Throttling Detection

fields @timestamp, @message
| filter @logStream like /kube-apiserver/
| filter @message like /throttle/ or @message like /rate limit/
| stats count() by bin(5m)

Unauthorized Access Attempts (Security Events)

fields @timestamp, @message
| filter @logStream like /kube-apiserver-audit/
| filter responseStatus.code = 403
| stats count() by user.username
| sort count desc

Authentication/Authorization Debugging

IAM Authentication Verification

# Check current IAM credentials
aws sts get-caller-identity

# Check cluster authentication mode
aws eks describe-cluster --name <cluster-name> \
  --query 'cluster.accessConfig.authenticationMode' --output text

aws-auth ConfigMap (CONFIG_MAP Mode)

# Check aws-auth ConfigMap
kubectl describe configmap aws-auth -n kube-system

EKS Access Entries (API / API_AND_CONFIG_MAP Mode)

# Create Access Entry
aws eks create-access-entry \
  --cluster-name <cluster-name> \
  --principal-arn arn:aws:iam::ACCOUNT:role/ROLE-NAME \
  --type STANDARD

# List Access Entries
aws eks list-access-entries --cluster-name <cluster-name>

IRSA (IAM Roles for Service Accounts) Debugging Checklist

# 1. Verify annotation on ServiceAccount
kubectl get sa <sa-name> -n <namespace> -o yaml

# 2. Check AWS environment variables inside the Pod
kubectl exec -it <pod-name> -- env | grep AWS

# 3. Verify OIDC Provider
aws eks describe-cluster --name <cluster-name> \
  --query 'cluster.identity.oidc.issuer' --output text

# 4. Verify OIDC Provider ARN and conditions in IAM Role Trust Policy
aws iam get-role --role-name <role-name> \
  --query 'Role.AssumeRolePolicyDocument'

Common IRSA Mistakes

Typo in the role ARN in the ServiceAccount annotation
Mismatch in namespace/sa name in the IAM Role Trust Policy
OIDC Provider not associated with the cluster
spec.serviceAccountName not specified for the Pod to use the ServiceAccount

Service Account Token Expiry (HTTP 401 Unauthorized)

In Kubernetes 1.21+, service account tokens are valid for 1 hour by default and are automatically refreshed by the kubelet. However, if you are using a legacy SDK that lacks token refresh logic, long-running workloads may encounter 401 Unauthorized errors.

Symptoms:

Pod suddenly returns HTTP 401 Unauthorized errors after a certain period (typically 1 hour)
Works normally temporarily after a restart, then 401 errors recur

Cause:

Projected Service Account Tokens expire after 1 hour by default
The kubelet automatically refreshes the token, but if the application reads the token file only once and caches it, the expired token continues to be used

Minimum Required SDK Versions:

Language	SDK	Minimum Version
Go	client-go	v0.15.7+
Python	kubernetes	12.0.0+
Java	fabric8	5.0.0+

Token Refresh Verification

Verify that your SDK supports automatic token refresh. If it does not, your application must periodically re-read the /var/run/secrets/kubernetes.io/serviceaccount/token file.

EKS Pod Identity Debugging

EKS Pod Identity is an alternative to IRSA that provides a simpler setup for granting AWS IAM permissions to Pods.

# Check Pod Identity Associations
aws eks list-pod-identity-associations --cluster-name $CLUSTER
aws eks describe-pod-identity-association --cluster-name $CLUSTER \
  --association-id $ASSOC_ID

# Check Pod Identity Agent status
kubectl get pods -n kube-system -l app.kubernetes.io/name=eks-pod-identity-agent
kubectl logs -n kube-system -l app.kubernetes.io/name=eks-pod-identity-agent --tail=50

Pod Identity Debugging Checklist:

Verify the eks-pod-identity-agent Add-on is installed
Verify the correct association is linked to the Pod's ServiceAccount
Verify the IAM Role trust policy includes the pods.eks.amazonaws.com service principal

Pod Identity vs IRSA

Pod Identity has a simpler setup than IRSA and makes cross-account access easier. Pod Identity is recommended for new workloads.

EKS Add-on Troubleshooting

# List Add-ons
aws eks list-addons --cluster-name <cluster-name>

# Check Add-on status in detail
aws eks describe-addon --cluster-name <cluster-name> --addon-name <addon-name>

# Update Add-on (resolve conflicts with PRESERVE to keep existing settings)
aws eks update-addon --cluster-name <cluster-name> --addon-name <addon-name> \
  --addon-version <version> --resolve-conflicts PRESERVE

Add-on	Common Error Patterns	Diagnosis	Resolution
CoreDNS	Pod CrashLoopBackOff, DNS timeouts	`kubectl logs -n kube-system -l k8s-app=kube-dns`	Check ConfigMap, `kubectl rollout restart deployment coredns -n kube-system`
kube-proxy	Service communication failure, iptables errors	`kubectl logs -n kube-system -l k8s-app=kube-proxy`	Verify DaemonSet image version, `kubectl rollout restart daemonset kube-proxy -n kube-system`
VPC CNI	Pod IP allocation failure, ENI errors	`kubectl logs -n kube-system -l k8s-app=aws-node`	Check IPAMD logs, verify ENI/IP limits (see Section 6)
EBS CSI	PVC Pending, volume attach failure	`kubectl logs -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver`	Check IRSA permissions, verify AZ matching (see Section 7)

Cluster Health Issue Codes

When diagnosing infrastructure-level issues with the EKS cluster itself, check the cluster health status.

# Check cluster health issues
aws eks describe-cluster --name $CLUSTER \
  --query 'cluster.health' --output json

🏥 클러스터 헬스 이슈 코드

aws eks describe-cluster --query 'cluster.health'

SUBNET_NOT_FOUND

클러스터 서브넷이 삭제됨

→ 새 서브넷 연결 필요

⚠️ 조건부 복구

SECURITY_GROUP_NOT_FOUND

클러스터 보안그룹이 삭제됨

→ 보안그룹 재생성

⚠️ 조건부 복구

IP_NOT_AVAILABLE

서브넷에 IP 부족

→ 서브넷 추가/확장

✅ 복구 가능

VPC_NOT_FOUND

VPC가 삭제됨

→ 클러스터 재생성 필요

❌ 복구 불가

ASSUME_ROLE_ACCESS_DENIED

클러스터 IAM Role 권한 문제

→ IAM 정책 수정

✅ 복구 가능

KMS_KEY_DISABLED

Secrets 암호화 KMS 키 비활성화

→ KMS 키 재활성화

✅ 복구 가능

KMS_KEY_NOT_FOUND

KMS 키 삭제됨

→ 복구 불가

❌ 복구 불가

Unrecoverable Issues

VPC_NOT_FOUND and KMS_KEY_NOT_FOUND are unrecoverable. The cluster must be recreated.

4. Node-Level Debugging

Node Join Failure Debugging

When a node fails to join the cluster, there can be various causes. Below are the 8 most common causes and their diagnostic methods.

Common Causes of Node Join Failure:

Node IAM Role not registered in aws-auth ConfigMap (or Access Entry not created) — The node cannot authenticate with the API server
ClusterName in bootstrap script does not match the actual cluster name — kubelet attempts to connect to the wrong cluster
Node security group does not allow communication with the control plane — TCP 443 (API server) and TCP 10250 (kubelet) ports are required
Auto-assign public IP is disabled in a public subnet — Cannot access the internet on clusters with only the public endpoint enabled
VPC DNS configuration issue — enableDnsHostnames or enableDnsSupport is disabled
STS regional endpoint is disabled — STS call fails during IAM authentication
Instance profile ARN registered in aws-auth instead of node IAM Role ARN — Only the Role ARN should be registered in aws-auth
eks:kubernetes.io/cluster-name tag missing (self-managed nodes) — EKS cannot recognize the node as belonging to the cluster

Diagnostic Commands:

# Check node bootstrap logs (after SSM access)
sudo journalctl -u kubelet --no-pager | tail -50
sudo cat /var/log/cloud-init-output.log | tail -50

# Check security group rules
aws ec2 describe-security-groups --group-ids $CLUSTER_SG \
  --query 'SecurityGroups[].IpPermissions' --output table

# Check VPC DNS settings
aws ec2 describe-vpc-attribute --vpc-id $VPC_ID --attribute enableDnsHostnames
aws ec2 describe-vpc-attribute --vpc-id $VPC_ID --attribute enableDnsSupport

ARN to Register in aws-auth

The aws-auth ConfigMap requires the IAM Role ARN (arn:aws:iam::ACCOUNT:role/...), not the instance profile ARN (arn:aws:iam::ACCOUNT:instance-profile/...). This is an extremely common mistake and a leading cause of node join failures.

Node NotReady Decision Tree

kubelet / containerd Debugging

# Connect to node via SSM
aws ssm start-session --target <instance-id>

# Check kubelet status
systemctl status kubelet
journalctl -u kubelet -n 100 -f

# Check containerd status
systemctl status containerd

# Check container runtime status
crictl pods
crictl ps -a

# Check logs for a specific container
crictl logs <container-id>

SSM Access Prerequisites

SSM access requires the AmazonSSMManagedInstanceCore policy to be attached to the node's IAM Role. This is included by default in EKS managed node groups, but if you are using a custom AMI, verify that the SSM Agent is installed.

Resource Pressure Diagnosis and Resolution

# Check node conditions
kubectl describe node <node-name>

Condition	Threshold	Diagnostic Command	Resolution
DiskPressure	Available disk < 10%	`df -h` (after SSM access)	Clean unused images with `crictl rmi --prune`, remove stopped containers with `crictl rm`
MemoryPressure	Available memory < 100Mi	`free -m` (after SSM access)	Evict low-priority Pods, adjust memory requests/limits, replace node
PIDPressure	Available PIDs < 5%	`ps aux \| wc -l` (after SSM access)	Increase `kernel.pid_max`, identify and restart the container causing the PID leak

Karpenter Node Provisioning Debugging

# Check Karpenter controller logs
kubectl logs -f deployment/karpenter -n kube-system

# Check NodePool status
kubectl get nodepool
kubectl describe nodepool <nodepool-name>

# Check EC2NodeClass
kubectl get ec2nodeclass
kubectl describe ec2nodeclass <nodeclass-name>

# When provisioning fails, verify:
# 1. NodePool limits have not been exceeded
# 2. EC2NodeClass subnet/security group selectors are correct
# 3. Service Quotas are sufficient for the instance types
# 4. Pod nodeSelector/affinity matches NodePool requirements

Karpenter v1 API Changes

In Karpenter v1.0+, Provisioner has been renamed to NodePool and AWSNodeTemplate to EC2NodeClass. If you are using v0.x configurations, migration is required. Update the API group to karpenter.sh/v1.

Managed Node Group Error Codes

Check the health status of Managed Node Groups to diagnose provisioning and operational issues.

# Check node group health status
aws eks describe-nodegroup --cluster-name $CLUSTER --nodegroup-name $NODEGROUP \
  --query 'nodegroup.health' --output json

🖥️ Managed Node Group 에러 코드

aws eks describe-nodegroup --query 'nodegroup.health'

🔴AccessDenied

노드 IAM Role에 필요한 권한 부족

→ eks:node-manager ClusterRole/ClusterRoleBinding 확인 및 복구

🟡AmiIdNotFound

Launch Template의 AMI ID가 존재하지 않음

→ 유효한 EKS optimized AMI ID로 업데이트

🔴AutoScalingGroupNotFound

ASG가 삭제되었거나 존재하지 않음

→ 노드 그룹 삭제 후 재생성

🔴ClusterUnreachable

노드가 EKS API 서버에 연결 불가

→ VPC 설정, 보안그룹, 엔드포인트 접근성 확인

🟡Ec2SecurityGroupNotFound

지정된 보안그룹이 삭제됨

→ 올바른 보안그룹 생성 후 노드그룹 재구성

🟡Ec2LaunchTemplateNotFound

Launch Template이 삭제됨

→ 새 Launch Template 생성 후 노드그룹 업데이트

⚪Ec2LaunchTemplateVersionMismatch

Launch Template 버전 불일치

→ 노드그룹이 참조하는 버전 확인 및 수정

🟡IamInstanceProfileNotFound

인스턴스 프로파일이 존재하지 않음

→ IAM 인스턴스 프로파일 재생성

🔴IamNodeRoleNotFound

노드 IAM Role이 삭제됨

→ IAM Role 재생성 후 필요 정책 연결

🟡AsgInstanceLaunchFailures

EC2 인스턴스 시작 실패 (용량 부족 등)

→ 다른 인스턴스 타입/AZ 추가, Service Quotas 확인

🟡NodeCreationFailure

노드 생성 일반 실패

→ CloudTrail에서 상세 에러 확인

🟡InstanceLimitExceeded

EC2 인스턴스 한도 초과

→ Service Quotas에서 한도 증가 요청

🟡InsufficientFreeAddresses

서브넷의 가용 IP 주소 부족

→ 서브넷 CIDR 확장 또는 새 서브넷 추가

⚪InternalFailure

AWS 내부 오류

→ 재시도, 지속시 AWS Support 문의

AccessDenied Error Recovery -- Checking eks:node-manager ClusterRole:

The AccessDenied error typically occurs when the eks:node-manager ClusterRole or ClusterRoleBinding has been deleted or modified.

# Check eks:node-manager ClusterRole
kubectl get clusterrole eks:node-manager
kubectl get clusterrolebinding eks:node-manager

AccessDenied Recovery

If the eks:node-manager ClusterRole/ClusterRoleBinding is missing, EKS does not automatically restore them. You must recover manually using one of these methods:

Method 1: Manual Recreation (Recommended)

# eks-node-manager-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: eks:node-manager
rules:
  - apiGroups: ['']
    resources: [pods]
    verbs: [get, list, watch, delete]
  - apiGroups: ['']
    resources: [nodes]
    verbs: [get, list, watch, patch]
  - apiGroups: ['']
    resources: [pods/eviction]
    verbs: [create]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: eks:node-manager
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: eks:node-manager
subjects:
  - apiGroup: rbac.authorization.k8s.io
    kind: User
    name: eks:node-manager

kubectl auth reconcile -f eks-node-manager-role.yaml

Method 2: Recreate Node Group

# RBAC resources are created together when creating a new node group
eksctl create nodegroup --cluster=<cluster-name> --name=<new-nodegroup-name>

Method 3: Upgrade Node Group

# Upgrade process may trigger RBAC re-setup
eksctl upgrade nodegroup --cluster=<cluster-name> --name=<nodegroup-name>

Note: Kubernetes default system ClusterRoles (system:*) are auto-reconciled by the API server, but EKS-specific ClusterRoles (eks:*) are not auto-restored. Always backup RBAC resources before deletion.

Debugging Node Bootstrap with Node Readiness Controller

New Kubernetes Feature (February 2026)

Node Readiness Controller is a new project announced on the official Kubernetes blog that declaratively solves premature scheduling issues during node bootstrapping.

Problem Scenario

In standard Kubernetes, workloads are scheduled as soon as a node reaches Ready state. However, the node may not actually be fully prepared:

Incomplete Component	Symptom	Impact
GPU driver/firmware loading	`nvidia-smi` failure, Pod `CrashLoopBackOff`	GPU workload failure
CNI plugin initializing	Pod IP unassigned, `NetworkNotReady`	Network communication failure
CSI driver not registered	PVC `Pending`, volume mount failure	Storage inaccessible
Security agent not installed	Compliance violation	Security policy unmet

How Node Readiness Controller Works

Node Readiness Controller declaratively manages custom taints, delaying workload scheduling until all infrastructure requirements are met:

Debugging Checklist

When a node is Ready but Pods are not being scheduled:

# 1. Check custom readiness taints on the node
kubectl get node <node-name> -o jsonpath='{.spec.taints}' | jq .

# 2. Filter for node.readiness related taints
kubectl get nodes -o json | jq '
  .items[] |
  select(.spec.taints // [] | any(.key | startswith("node.readiness"))) |
  {name: .metadata.name, taints: [.spec.taints[] | select(.key | startswith("node.readiness"))]}
'

# 3. Check Pod tolerations vs node taint mismatch
kubectl describe pod <pending-pod> | grep -A 20 "Events:"

schedulingGates allow controlling scheduling readiness from the Pod side:

apiVersion: v1
kind: Pod
metadata:
  name: gated-pod
spec:
  schedulingGates:
    - name: "example.com/gpu-validation"  # Scheduling waits until this gate is removed
  containers:
    - name: app
      image: app:latest

# Find Pods with schedulingGates
kubectl get pods -o json | jq '
  .items[] |
  select(.spec.schedulingGates != null and (.spec.schedulingGates | length > 0)) |
  {name: .metadata.name, namespace: .metadata.namespace, gates: .spec.schedulingGates}
'

AWS Load Balancer Controller uses the elbv2.k8s.aws/pod-readiness-gate-inject annotation to delay Pod Ready state transition until ALB/NLB target registration is complete:

# Check Readiness Gate status
kubectl get pod <pod-name> -o jsonpath='{.status.conditions}' | jq '
  [.[] | select(.type | contains("target-health"))]
'

# Check if readiness gate injection is enabled for namespace
kubectl get namespace <ns> -o jsonpath='{.metadata.labels.elbv2\.k8s\.aws/pod-readiness-gate-inject}'

Readiness Feature Comparison

Feature	Target	Control Mechanism	Status
Node Readiness Controller	Node	Taint-based	New (Feb 2026)
Pod Scheduling Readiness	Pod	schedulingGates	GA (K8s 1.30)
Pod Readiness Gates	Pod	Readiness Conditions	GA (AWS LB Controller)

Using eks-node-viewer

eks-node-viewer is a tool that visualizes node resource utilization in real-time in the terminal.

# Basic usage (CPU-based)
eks-node-viewer

# View both CPU and memory
eks-node-viewer --resources cpu,memory

# View a specific NodePool only
eks-node-viewer --node-selector karpenter.sh/nodepool=<nodepool-name>

5. Workload Debugging

Pod Status Debugging Flowchart

Basic Debugging Commands

# Check Pod status
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

# Check current/previous container logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

# Check namespace events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Check resource usage
kubectl top pods -n <namespace>

Using kubectl debug

Ephemeral Container (Add a debug container to a running Pod)

# Basic ephemeral container
kubectl debug <pod-name> -it --image=busybox --target=<container-name>

# Image with network debugging tools
kubectl debug <pod-name> -it --image=nicolaka/netshoot --target=<container-name>

Pod Copy (Clone a Pod for debugging)

# Clone a Pod and start with a different image
kubectl debug <pod-name> --copy-to=debug-pod --image=ubuntu

# Change the command when cloning a Pod
kubectl debug <pod-name> --copy-to=debug-pod --container=<container-name> -- sh

Node Debugging (Direct access to a node)

# Node debugging (host filesystem is mounted at /host)
kubectl debug node/<node-name> -it --image=ubuntu

kubectl debug vs SSM

kubectl debug node/ can be used even on nodes where the SSM Agent is not installed. However, to access the host network namespace, add the --profile=sysadmin option.

Deployment Rollout Debugging

# Check rollout status
kubectl rollout status deployment/<name>

# Rollout history
kubectl rollout history deployment/<name>

# Roll back to the previous version
kubectl rollout undo deployment/<name>

# Roll back to a specific revision
kubectl rollout undo deployment/<name> --to-revision=2

# Restart Deployment (Rolling restart)
kubectl rollout restart deployment/<name>

HPA / VPA Debugging

# Check HPA status
kubectl get hpa
kubectl describe hpa <hpa-name>

# Verify metrics-server is running
kubectl get deployment metrics-server -n kube-system
kubectl top pods  # If this command fails, metrics-server has an issue

# Check scaling failure reasons in HPA events
kubectl describe hpa <hpa-name> | grep -A 5 "Events"

HPA Scaling Failure Analysis:

Symptom	Cause	Resolution
`unable to get metrics`	metrics-server not installed or failing	Check metrics-server Pod status and restart
`current metrics unknown`	Metric collection failing from target Pods	Verify resource requests are set on the Pod
`target not found`	scaleTargetRef mismatch	Verify Deployment/StatefulSet name and apiVersion
Scale-up followed by immediate scale-down	stabilizationWindow not configured	Set `behavior.scaleDown.stabilizationWindowSeconds`

Probe Debugging and Best Practices

# Recommended Probe configuration example
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: my-app:latest
        ports:
        - containerPort: 8080
        # Startup Probe: Confirms app startup completion (essential for slow-starting apps)
        startupProbe:
          httpGet:
            path: /healthz
            port: 8080
          failureThreshold: 30    # Wait up to 300 seconds (30 x 10s)
          periodSeconds: 10
        # Liveness Probe: Checks if the app is alive (deadlock detection)
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
          successThreshold: 1
        # Readiness Probe: Checks if the app can receive traffic
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
          successThreshold: 1

Probe Configuration Warnings

Do not include external dependencies in the Liveness Probe (such as database connectivity checks). This can trigger cascading failures where all Pods restart when an external service goes down.
Do not set a high initialDelaySeconds without a startupProbe. Since liveness/readiness probes are disabled until the startupProbe succeeds, use a startupProbe for slow-starting applications.
A Readiness Probe failure does not restart the Pod; it only removes the Pod from the Service Endpoints.

6. Networking Debugging

Networking Debugging Workflow

VPC CNI Debugging

# Check VPC CNI Pod status
kubectl get pods -n kube-system -l k8s-app=aws-node

# Check VPC CNI logs
kubectl logs -n kube-system -l k8s-app=aws-node --tail=50

# Check current VPC CNI version
kubectl describe daemonset aws-node -n kube-system | grep Image

Resolving IP Exhaustion:

# Check available IPs per subnet
aws ec2 describe-subnets --subnet-ids <subnet-id> \
  --query 'Subnets[].{ID:SubnetId,AZ:AvailabilityZone,Available:AvailableIpAddressCount}'

# Enable Prefix Delegation (16x IP capacity increase)
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true

ENI Limits and IP Quotas:

Each EC2 instance type has limits on the number of ENIs that can be attached and the number of IPs per ENI. Enabling Prefix Delegation significantly increases the IP allocation per ENI.

DNS Troubleshooting

# Check CoreDNS Pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Test DNS resolution
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

# Check CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml

# Restart CoreDNS
kubectl rollout restart deployment coredns -n kube-system

ndots Issue

In the default Kubernetes resolv.conf configuration, ndots:5 causes cluster-internal DNS suffixes to be tried first for domains with fewer than 5 dots. When accessing external domains, this results in 4 additional unnecessary DNS queries, increasing latency.

Solution: Set ndots:2 via dnsConfig.options in the Pod spec, or append a trailing . to external domain names (e.g., api.example.com.).

Note: The VPC DNS throttling limit is 1,024 packets/sec per ENI.

Service Debugging

# Check Service status
kubectl get svc <service-name>

# Check Endpoints (whether backend Pods are connected)
kubectl get endpoints <service-name>

# Detailed Service information (verify selector)
kubectl describe svc <service-name>

# Check Selector
kubectl get svc <service-name> -o jsonpath='{.spec.selector}'

# Find Pods matching the Selector
kubectl get pods -l <key>=<value>

Common Service Issues:

Symptom	What to Check	Resolution
Endpoints are empty	Service selector and Pod label mismatch	Fix labels
ClusterIP unreachable	Whether kube-proxy is running normally	`kubectl logs -n kube-system -l k8s-app=kube-proxy`
NodePort unreachable	Whether Security Group allows ports 30000-32767	Add SG inbound rule
LoadBalancer Pending	Whether AWS Load Balancer Controller is installed	Install controller and verify IAM permissions

NetworkPolicy Debugging

The most common mistake with NetworkPolicy is confusing AND vs OR selectors.

# AND logic (combining two selectors within the same from entry)
# Allow only "Pods with role client in the alice namespace"
- from:
  - namespaceSelector:
      matchLabels:
        user: alice
    podSelector:
      matchLabels:
        role: client

# OR logic (separating into distinct from entries)
# Allow "all Pods in the alice namespace" OR "Pods with role client in any namespace"
- from:
  - namespaceSelector:
      matchLabels:
        user: alice
  - podSelector:
      matchLabels:
        role: client

AND vs OR Caution

The two YAML examples above differ by only one indentation level but result in completely different security policies. In AND logic, namespaceSelector and podSelector are within the same - from entry, while in OR logic, they are separate - from entries.

Using netshoot

netshoot is a container image that includes all the tools needed for network debugging.

# Add as an ephemeral container to an existing Pod
kubectl debug <pod-name> -it --image=nicolaka/netshoot

# Run a standalone debugging Pod
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot

# Available tools inside include:
# - curl, wget: HTTP testing
# - dig, nslookup: DNS testing
# - tcpdump: Packet capture
# - iperf3: Bandwidth testing
# - ss, netstat: Socket status inspection
# - traceroute, mtr: Route tracing

Practical Debugging Scenario: Verifying Pod-to-Pod Communication

# Test connectivity to another Service from a netshoot Pod
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- bash

# Verify DNS resolution
dig <service-name>.<namespace>.svc.cluster.local

# Test TCP connectivity
curl -v http://<service-name>.<namespace>.svc.cluster.local:<port>/health

# Capture packets (traffic to a specific Pod IP)
tcpdump -i any host <pod-ip> -n

7. Storage Debugging

Storage Debugging Decision Tree

EBS CSI Driver Debugging

# Check EBS CSI Driver Pod status
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver

# Check Controller logs
kubectl logs -n kube-system -l app=ebs-csi-controller -c ebs-plugin --tail=100

# Check Node logs
kubectl logs -n kube-system -l app=ebs-csi-node -c ebs-plugin --tail=100

# Verify IRSA ServiceAccount
kubectl describe sa ebs-csi-controller-sa -n kube-system

EBS CSI Driver Error Patterns:

Error Message	Cause	Resolution
`could not create volume`	Insufficient IAM permissions	Add `ec2:CreateVolume`, `ec2:AttachVolume`, etc. to the IRSA Role
`volume is already attached to another node`	Not detached from previous node	Clean up previous Pod/node, wait for EBS volume detach (~6 min)
`could not attach volume: already at max`	Instance EBS volume limit exceeded	Use a larger instance type (Nitro instances: up to 128 volumes)
`failed to provision volume with StorageClass`	StorageClass does not exist or misconfigured	Verify StorageClass name and parameters

Recommended StorageClass Configuration:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: topology-aware-ebs
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

WaitForFirstConsumer

Using volumeBindingMode: WaitForFirstConsumer defers PVC binding until Pod scheduling time. This ensures that the volume is created in the same AZ where the Pod is scheduled, preventing AZ mismatch issues.

EFS CSI Driver Debugging

# Check EFS CSI Driver Pod status
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-efs-csi-driver

# Check Controller logs
kubectl logs -n kube-system -l app=efs-csi-controller -c efs-plugin --tail=100

# Check EFS filesystem status
aws efs describe-file-systems --file-system-id <fs-id>

# Verify Mount Targets (must exist in each AZ)
aws efs describe-mount-targets --file-system-id <fs-id>

EFS Checklist:

Verify that Mount Targets exist in the subnets of all AZs where Pods are running
Verify that the Mount Target's Security Group allows TCP 2049 (NFS) port
Verify that the node's Security Group permits outbound TCP 2049 to the EFS Mount Target

PV/PVC Status Inspection and Stuck Resolution

# Check PVC status
kubectl get pvc -n <namespace>

# Check PV status
kubectl get pv

# If PVC is stuck in Terminating (remove finalizer)
kubectl patch pvc <pvc-name> -n <namespace> -p '{"metadata":{"finalizers":null}}'

# Change PV from Released to Available (for reuse)
kubectl patch pv <pv-name> -p '{"spec":{"claimRef":null}}'

Manual Finalizer Removal Warning

Manually removing a finalizer may leave associated storage resources (such as EBS volumes) uncleaned. First verify that the volume is not in use, and check the AWS console to ensure no orphan volumes are created.

8. Observability and Monitoring

Observability Stack Architecture

Container Insights Setup

# Install Container Insights Add-on
aws eks create-addon \
  --cluster-name <cluster-name> \
  --addon-name amazon-cloudwatch-observability

# Verify installation
kubectl get pods -n amazon-cloudwatch

Metric Debugging: PromQL Queries

CPU Throttling Detection

sum(rate(container_cpu_cfs_throttled_periods_total{namespace="production"}[5m]))
/ sum(rate(container_cpu_cfs_periods_total{namespace="production"}[5m])) > 0.25

CPU Throttling Threshold

Throttling above 25% causes performance degradation. Consider removing or increasing CPU limits. Many organizations adopt a strategy of setting only CPU requests without CPU limits.

OOMKilled Detection

kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0

Pod Restart Rate

sum(rate(kube_pod_container_status_restarts_total[15m])) by (namespace, pod) > 0

Node CPU Utilization (Warning above 80%)

100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80

Node Memory Utilization (Warning above 85%)

(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85

Log Debugging: CloudWatch Logs Insights

Error Log Analysis

fields @timestamp, @message, kubernetes.container_name, kubernetes.pod_name
| filter @message like /ERROR|FATAL|Exception/
| sort @timestamp desc
| limit 50

Latency Analysis

fields @timestamp, @message
| filter @message like /latency|duration|elapsed/
| parse @message /latency[=:]\s*(?<latency_ms>\d+)/
| stats avg(latency_ms), max(latency_ms), p99(latency_ms) by bin(5m)

Error Pattern Analysis for a Specific Pod

fields @timestamp, @message
| filter kubernetes.pod_name like /api-server/
| filter @message like /error|Error|ERROR/
| stats count() by bin(1m)
| sort bin asc

OOMKilled Event Tracking

fields @timestamp, @message
| filter @message like /OOMKilled|oom-kill|Out of memory/
| sort @timestamp desc
| limit 20

Container Restart Events

fields @timestamp, @message, kubernetes.pod_name
| filter @message like /Back-off restarting failed container|CrashLoopBackOff/
| stats count() by kubernetes.pod_name
| sort count desc

Alert Rules: PrometheusRule Example

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-alerts
spec:
  groups:
  - name: kubernetes-pods
    rules:
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
        description: "Pod {{ $labels.pod }} has been restarting over the last 15 minutes."

    - alert: PodOOMKilled
      expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} OOMKilled"
        description: "Pod {{ $labels.pod }} was terminated due to out of memory. Memory limits adjustment is required."

  - name: kubernetes-nodes
    rules:
    - alert: NodeNotReady
      expr: kube_node_status_condition{condition="Ready",status="true"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Node {{ $labels.node }} is NotReady"

    - alert: NodeHighCPU
      expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Node {{ $labels.instance }} CPU usage above 80%"

    - alert: NodeHighMemory
      expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Node {{ $labels.instance }} memory usage above 85%"

ADOT (AWS Distro for OpenTelemetry) Debugging

ADOT is the AWS-managed distribution of OpenTelemetry that collects traces, metrics, and logs and sends them to various AWS services (X-Ray, CloudWatch, AMP, etc.).

# Check ADOT Add-on status
aws eks describe-addon --cluster-name $CLUSTER \
  --addon-name adot --query 'addon.{status:status,version:addonVersion}'

# Check ADOT Collector Pods
kubectl get pods -n opentelemetry-operator-system
kubectl logs -n opentelemetry-operator-system -l app.kubernetes.io/name=opentelemetry-operator --tail=50

# Check OpenTelemetryCollector CR
kubectl get otelcol -A
kubectl describe otelcol -n $NAMESPACE $COLLECTOR_NAME

Common ADOT Issues:

Symptom	Cause	Resolution
Operator Pod `CrashLoopBackOff`	CertManager not installed	CertManager is required for ADOT operator webhook certificate management. `kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml`
Collector fails to send to AMP	Insufficient IAM permissions	Add `aps:RemoteWrite` permission to IRSA/Pod Identity
X-Ray traces not received	Insufficient IAM permissions	Add `xray:PutTraceSegments`, `xray:PutTelemetryRecords` permissions to IRSA/Pod Identity
CloudWatch metrics not received	Insufficient IAM permissions	Add `cloudwatch:PutMetricData` permission to IRSA/Pod Identity
Collector Pod `OOMKilled`	Insufficient resources	Increase the Collector's resources.limits.memory when collecting large volumes of traces/metrics

ADOT Permission Separation

AMP remote write, X-Ray, and CloudWatch each require different IAM permissions. If the Collector sends data to multiple backends, verify that all required permissions are included in the IAM Role.

9. Incident Detection Mechanisms and Logging Architecture

9.1 Incident Detection Strategy Overview

To rapidly detect incidents in an EKS environment, you must systematically build a 4-layer pipeline: Data Sources -> Collection -> Analysis & Detection -> Alerting & Response. Each layer must be organically connected to minimize MTTD (Mean Time To Detect).

4-Layer Architecture Description:

Layer	Role	Key Components
Data Sources	Generates all observable signals from the cluster	Control Plane Logs, Data Plane Logs, Metrics, Traces
Collection Layer	Standardizes and forwards data from various sources to a central location	Fluent Bit, CloudWatch Agent, ADOT Collector
Analysis & Detection	Analyzes collected data and detects anomalies	CloudWatch Logs Insights, AMP, OpenSearch, Anomaly Detection
Alerting & Response	Notifies detected incidents through appropriate channels and executes auto-remediation	CloudWatch Alarms, Alertmanager, SNS -> Lambda, PagerDuty/Slack

9.2 Recommended Logging Architecture

Option A: AWS Native Stack (Small to Medium Clusters)

An architecture centered on AWS managed services that minimizes operational overhead.

Layer	Component	Purpose
Collection	Fluent Bit (DaemonSet)	Node/container log collection
Transport	CloudWatch Logs	Central log store
Analysis	CloudWatch Logs Insights	Query-based analysis
Detection	CloudWatch Anomaly Detection	ML-based anomaly detection
Alerting	CloudWatch Alarms -> SNS	Threshold/anomaly-based alerting

Fluent Bit DaemonSet Deployment Example:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: amazon-cloudwatch
  labels:
    app.kubernetes.io/name: fluent-bit
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: fluent-bit
  template:
    metadata:
      labels:
        app.kubernetes.io/name: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      containers:
        - name: fluent-bit
          image: public.ecr.aws/aws-observability/aws-for-fluent-bit:2.32.0
          resources:
            limits:
              memory: 200Mi
            requests:
              cpu: 100m
              memory: 100Mi
          volumeMounts:
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
            - name: fluent-bit-config
              mountPath: /fluent-bit/etc/
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers
        - name: fluent-bit-config
          configMap:
            name: fluent-bit-config

Fluent Bit vs Fluentd

Fluent Bit uses more than 10x less memory than Fluentd (~10MB vs ~100MB). Deploying Fluent Bit as a DaemonSet is the standard pattern in EKS environments. Using the amazon-cloudwatch-observability Add-on automatically installs Fluent Bit.

Option B: Open-Source Based Stack (Large-Scale / Multi-Cluster)

An architecture that combines open-source tools with AWS managed services to ensure scalability and flexibility in large-scale environments.

Layer	Component	Purpose
Collection	Fluent Bit + ADOT Collector	Unified log/metric/trace collection
Metrics	Amazon Managed Prometheus (AMP)	Time-series metric storage
Logs	Amazon OpenSearch Service	Large-scale log analytics
Traces	AWS X-Ray / Jaeger	Distributed tracing
Visualization	Amazon Managed Grafana	Unified dashboards
Alerting	Alertmanager + PagerDuty/Slack	Advanced routing, grouping, silencing

Multi-Cluster Architecture

In multi-cluster environments, a hub-and-spoke architecture is recommended where the ADOT Collector in each cluster sends metrics to a central AMP workspace. Grafana can monitor all clusters from a single dashboard.

9.3 Incident Detection Patterns

Pattern 1: Threshold-Based Detection

The most fundamental detection approach. An alert is triggered when a predefined threshold is exceeded.

# PrometheusRule - Threshold-based alert example
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: eks-threshold-alerts
  namespace: monitoring
spec:
  groups:
    - name: eks-thresholds
      rules:
        - alert: HighPodRestartRate
          expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restart count increasing"
            description: "{{ $value }} restarts detected within 1 hour"

        - alert: NodeMemoryPressure
          expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.85
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Node {{ $labels.instance }} memory usage above 85%"

        - alert: PVCNearlyFull
          expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.9
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "PVC {{ $labels.persistentvolumeclaim }} capacity above 90%"

Pattern 2: Anomaly Detection

Uses ML to learn normal patterns and detect deviations. Useful when it is difficult to predefine thresholds.

# CloudWatch Anomaly Detection setup
aws cloudwatch put-anomaly-detector \
  --single-metric-anomaly-detector '{
    "Namespace": "ContainerInsights",
    "MetricName": "pod_cpu_utilization",
    "Dimensions": [
      {"Name": "ClusterName", "Value": "'$CLUSTER'"},
      {"Name": "Namespace", "Value": "production"}
    ],
    "Stat": "Average"
  }'

# Create alarm based on Anomaly Detection
aws cloudwatch put-metric-alarm \
  --alarm-name "eks-cpu-anomaly" \
  --alarm-description "EKS CPU utilization anomaly detected" \
  --evaluation-periods 3 \
  --comparison-operator LessThanLowerOrGreaterThanUpperThreshold \
  --threshold-metric-id ad1 \
  --metrics '[
    {
      "Id": "m1",
      "MetricStat": {
        "Metric": {
          "Namespace": "ContainerInsights",
          "MetricName": "pod_cpu_utilization",
          "Dimensions": [
            {"Name": "ClusterName", "Value": "'$CLUSTER'"}
          ]
        },
        "Period": 300,
        "Stat": "Average"
      }
    },
    {
      "Id": "ad1",
      "Expression": "ANOMALY_DETECTION_BAND(m1, 2)"
    }
  ]' \
  --alarm-actions $SNS_TOPIC_ARN

Anomaly Detection Learning Period

Anomaly Detection requires a minimum learning period of 2 weeks. Use threshold-based alerts in parallel immediately after deploying a new service.

Pattern 3: Composite Alarms

Logically combines multiple individual alarms to reduce noise and detect actual incidents accurately.

# Combine individual alarms with AND/OR
aws cloudwatch put-composite-alarm \
  --alarm-name "eks-service-degradation" \
  --alarm-rule 'ALARM("high-error-rate") AND (ALARM("high-latency") OR ALARM("pod-restart-spike"))' \
  --alarm-actions $SNS_TOPIC_ARN \
  --alarm-description "Service degradation detected: error rate increase + latency increase or Pod restart spike"

Composite Alarm Tips

Individual alarms alone often produce many False Positives. Combining multiple signals with Composite Alarms enables accurate detection of real incidents. For example: "error rate increase AND latency increase" indicates a service outage, while "error rate increase AND Pod restarts" indicates an application crash.

Pattern 4: Log-Based Metric Filters

Detects specific patterns in CloudWatch Logs, converts them to metrics, and sets up alerts.

# Convert OOMKilled events to metrics
aws logs put-metric-filter \
  --log-group-name "/aws/eks/$CLUSTER/cluster" \
  --filter-name "OOMKilledEvents" \
  --filter-pattern '{ $.reason = "OOMKilled" || $.reason = "OOMKilling" }' \
  --metric-transformations \
    metricName=OOMKilledCount,metricNamespace=EKS/Custom,metricValue=1,defaultValue=0

# Detect 403 Forbidden events (security threat)
aws logs put-metric-filter \
  --log-group-name "/aws/eks/$CLUSTER/cluster" \
  --filter-name "UnauthorizedAccess" \
  --filter-pattern '{ $.responseStatus.code = 403 }' \
  --metric-transformations \
    metricName=ForbiddenAccessCount,metricNamespace=EKS/Security,metricValue=1,defaultValue=0

9.4 Incident Detection Maturity Model

The organization's incident detection capability is divided into 4 levels to diagnose the current state and provide a roadmap for growth to the next level.

Level	Stage	Detection Method	Tools	Target MTTD
Level 1	Basic	Manual monitoring + basic alarms	CloudWatch Alarms	< 30 min
Level 2	Standard	Thresholds + log metric filters	CloudWatch + Prometheus	< 10 min
Level 3	Advanced	Anomaly detection + Composite Alarms	Anomaly Detection + AMP	< 5 min
Level 4	Automated	Auto-detection + auto-remediation	Lambda + EventBridge + FIS	< 1 min

MTTD (Mean Time To Detect)

The average time from incident occurrence to detection. The goal is to continuously reduce MTTD as you grow from Level 1 to Level 4. Choose the appropriate level for your organization's SLOs.

9.5 Auto-Remediation Patterns

A pattern that uses EventBridge and Lambda to automatically execute recovery actions when a specific incident is detected.

# EventBridge rule: Detect Pod OOMKilled → trigger Lambda
aws events put-rule \
  --name "eks-oom-auto-remediation" \
  --event-pattern '{
    "source": ["aws.cloudwatch"],
    "detail-type": ["CloudWatch Alarm State Change"],
    "detail": {
      "alarmName": ["eks-oom-killed-alarm"],
      "state": {"value": ["ALARM"]}
    }
  }'

Auto-Remediation Caution

Apply auto-remediation to production only after thorough testing. Incorrect auto-remediation logic can worsen an incident. First validate the recovery logic in DRY_RUN mode where you only receive notifications, then gradually expand the scope of automation.

9.6 Recommended Alert Channel Matrix

Set appropriate alert channels and response SLAs based on incident severity to prevent Alert Fatigue and focus on critical incidents.

Severity	Alert Channel	Response SLA	Examples
P1 (Critical)	PagerDuty + Phone Call	Within 15 min	Complete service outage, risk of data loss
P2 (High)	Slack DM + PagerDuty	Within 30 min	Partial service outage, severe performance degradation
P3 (Medium)	Slack channel	Within 4 hours	Increasing Pod restarts, resource usage warnings
P4 (Low)	Email / Jira ticket	Next business day	Disk usage growth, certificate nearing expiry

Alert Fatigue Caution

Too many alerts cause operations teams to ignore them (Alert Fatigue). Deliver P3/P4 alerts only to Slack channels, and send only genuine incidents (P1/P2) to PagerDuty. It is important to periodically review alert rules and eliminate False Positives.

10. Debugging Quick Reference

Error Pattern → Cause → Resolution Quick Reference Table

🔍 에러 패턴 Quick Reference

에러 패턴 → 원인 → 해결 빠른 참조 (18건)

1CrashLoopBackOff

앱 크래시, 잘못된 설정, 의존성 미충족

→ kubectl logs --previous, 앱 설정/환경변수 점검

2ImagePullBackOff

이미지 미존재, 레지스트리 인증 실패

→ 이미지 이름/태그 확인, imagePullSecrets 설정

3OOMKilled

메모리 limits 초과

→ 메모리 limits 증가, 앱 메모리 누수 점검

4Pending (스케줄링 불가)

리소스 부족, nodeSelector 불일치

→ kubectl describe pod 이벤트 확인, 노드 용량/라벨 점검

5CreateContainerConfigError

ConfigMap/Secret 미존재

→ 참조되는 ConfigMap/Secret 존재 여부 확인

6Node NotReady

kubelet 장애, 리소스 압박

→ SSM으로 노드 접속, systemctl status kubelet

7FailedAttachVolume

EBS 볼륨 다른 노드에 연결됨

→ 이전 Pod 삭제, 볼륨 detach 대기 (~6분)

8FailedMount

EFS mount target/SG 설정 오류

→ mount target 존재 및 TCP 2049 허용 확인

9NetworkNotReady

VPC CNI 미시작

→ kubectl logs -n kube-system -l k8s-app=aws-node

10DNS resolution failed

CoreDNS 장애

→ CoreDNS Pod 상태/로그 확인, kubectl rollout restart

11Unauthorized / 403

RBAC 권한 부족, aws-auth 설정 오류

→ aws sts get-caller-identity, aws-auth/Access Entry 확인

12connection refused

Service Endpoint 없음, 포트 불일치

→ kubectl get endpoints, selector 및 포트 확인

13Evicted

노드 리소스 압박 (DiskPressure 등)

→ 노드 디스크 정리, Pod resource requests 조정

14FailedScheduling: Insufficient cpu/memory

클러스터 용량 부족

→ Karpenter NodePool limits 증가, 노드 추가

15Terminating (stuck)

Finalizer 미완료, preStop hook 지연

→ Finalizer 확인, 필요시 --force --grace-period=0

16Back-off pulling image

이미지 크기 큰 경우 pull 타임아웃

→ 이미지 최적화, ECR 같은 리전 레지스트리 사용

17readiness probe failed

앱 시작 지연, 헬스체크 엔드포인트 오류

→ startupProbe 추가, probe 타임아웃 조정

18Too many pods

노드당 최대 Pod 수 초과

→ max-pods 설정 확인, Prefix Delegation 활성화

Essential kubectl Command Cheat Sheet

Inspection and Diagnosis

# View all resource status at a glance
kubectl get all -n <namespace>

# Filter only unhealthy Pods
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded

# Detailed Pod information (including events)
kubectl describe pod <pod-name> -n <namespace>

# Namespace events (most recent first)
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Resource usage
kubectl top nodes
kubectl top pods -n <namespace> --sort-by=memory

Log Inspection

# Current container logs
kubectl logs <pod-name> -n <namespace>

# Previous (crashed) container logs
kubectl logs <pod-name> -n <namespace> --previous

# Specific container in a multi-container Pod
kubectl logs <pod-name> -n <namespace> -c <container-name>

# Real-time log streaming
kubectl logs -f <pod-name> -n <namespace>

# View logs from multiple Pods by label
kubectl logs -l app=<app-name> -n <namespace> --tail=50

Debugging

# Debug with an ephemeral container
kubectl debug <pod-name> -it --image=nicolaka/netshoot --target=<container-name>

# Node debugging
kubectl debug node/<node-name> -it --image=ubuntu

# Execute a command inside a Pod
kubectl exec -it <pod-name> -n <namespace> -- <command>

Deployment Management

# Rollout status/history/rollback
kubectl rollout status deployment/<name>
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name>

# Restart Deployment
kubectl rollout restart deployment/<name>

# Node maintenance (drain)
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl uncordon <node-name>

Recommended Tool Matrix

Scenario	Tool	Description
Network debugging	netshoot	Container with a comprehensive set of network tools
Node resource visualization	eks-node-viewer	Terminal-based node resource monitoring
Container runtime debugging	crictl	containerd debugging CLI
Log analysis	CloudWatch Logs Insights	AWS-native log query service
Metric querying	Prometheus / Grafana	PromQL-based metric analysis
Distributed tracing	ADOT / OpenTelemetry	Request path tracing
Cluster security audit	kube-bench	CIS Benchmark-based security scanning
YAML manifest validation	kubeval / kubeconform	Pre-deployment manifest validation
Karpenter debugging	Karpenter controller logs	Node provisioning issue diagnosis
IAM debugging	AWS IAM Policy Simulator	IAM permission verification

EKS Log Collector

EKS Log Collector is an AWS-provided script that automatically collects logs needed for debugging from EKS worker nodes and generates an archive file that can be submitted to AWS Support.

Installation and Execution:

# Download and run the script (on the node after SSM access)
curl -O https://raw.githubusercontent.com/awslabs/amazon-eks-ami/master/log-collector-script/linux/eks-log-collector.sh
sudo bash eks-log-collector.sh

Collected Items:

kubelet logs
containerd logs
iptables rules
CNI config (VPC CNI settings)
cloud-init logs
dmesg (kernel messages)
systemd units status

Output:

The collected logs are compressed and saved in the format /var/log/eks_i-xxxx_yyyy-mm-dd_HH-MM-SS.tar.gz.

S3 Upload:

# Upload collected logs directly to S3
sudo bash eks-log-collector.sh --upload s3://my-bucket/

Leveraging AWS Support

Attaching this log file when submitting an AWS Support case enables support engineers to quickly assess the node status, significantly reducing resolution time. Be sure to attach it when reporting node join failures, kubelet failures, network issues, etc.

EKS High Availability Architecture Guide - Architecture-level failure recovery strategies
GitOps-Based EKS Cluster Operations - GitOps deployment and operational automation
Ultra-Fast Autoscaling with Karpenter - Karpenter-based node provisioning optimization
Node Monitoring Agent - Node-level monitoring

1. Overview​

EKS Debugging Layers​

Debugging Methodology​

2. Incident Triage (Rapid Failure Assessment)​

First 5 Minutes Checklist​

30 Seconds: Initial Diagnosis​

2 Minutes: Scope Identification​

5 Minutes: Initial Response​

Scope Identification Decision Tree​

AZ Failure Detection​

AZ Failure Response Using ARC Zonal Shift​

CloudWatch Anomaly Detection​

Incident Response Escalation Matrix​

3. EKS Control Plane Debugging​

Control Plane Log Types​

Enabling Logs​

CloudWatch Logs Insights Queries​

API Server Errors (400+) Analysis​

Authentication Failure Tracking​

aws-auth ConfigMap Change Detection​

API Throttling Detection​

Unauthorized Access Attempts (Security Events)​

Authentication/Authorization Debugging​

IAM Authentication Verification​

aws-auth ConfigMap (CONFIG_MAP Mode)​

EKS Access Entries (API / API_AND_CONFIG_MAP Mode)​

IRSA (IAM Roles for Service Accounts) Debugging Checklist​

Service Account Token Expiry (HTTP 401 Unauthorized)​

EKS Pod Identity Debugging​

EKS Add-on Troubleshooting​

Cluster Health Issue Codes​

4. Node-Level Debugging​

Node Join Failure Debugging​

Node NotReady Decision Tree​

kubelet / containerd Debugging​

Resource Pressure Diagnosis and Resolution​

Karpenter Node Provisioning Debugging​

Managed Node Group Error Codes​

Debugging Node Bootstrap with Node Readiness Controller​

Problem Scenario​

How Node Readiness Controller Works​

Debugging Checklist​

Related Feature: Pod Scheduling Readiness (K8s 1.30 GA)​

Related Feature: Pod Readiness Gates (AWS LB Controller)​

Using eks-node-viewer​

5. Workload Debugging​

Pod Status Debugging Flowchart​

Basic Debugging Commands​

Using kubectl debug​

Ephemeral Container (Add a debug container to a running Pod)​

Pod Copy (Clone a Pod for debugging)​

Node Debugging (Direct access to a node)​

Deployment Rollout Debugging​

HPA / VPA Debugging​

Probe Debugging and Best Practices​

6. Networking Debugging​

Networking Debugging Workflow​

VPC CNI Debugging​

DNS Troubleshooting​

Service Debugging​

NetworkPolicy Debugging​

Using netshoot​

7. Storage Debugging​

Storage Debugging Decision Tree​

EBS CSI Driver Debugging​

EFS CSI Driver Debugging​

PV/PVC Status Inspection and Stuck Resolution​

8. Observability and Monitoring​

Observability Stack Architecture​

Container Insights Setup​

Metric Debugging: PromQL Queries​

CPU Throttling Detection​

OOMKilled Detection​

Pod Restart Rate​

Node CPU Utilization (Warning above 80%)​

Node Memory Utilization (Warning above 85%)​

Log Debugging: CloudWatch Logs Insights​

Error Log Analysis​

Latency Analysis​

Error Pattern Analysis for a Specific Pod​

1. Overview

EKS Debugging Layers

Debugging Methodology

2. Incident Triage (Rapid Failure Assessment)

First 5 Minutes Checklist

30 Seconds: Initial Diagnosis

2 Minutes: Scope Identification

5 Minutes: Initial Response

Scope Identification Decision Tree

AZ Failure Detection

AZ Failure Response Using ARC Zonal Shift

CloudWatch Anomaly Detection

Incident Response Escalation Matrix

3. EKS Control Plane Debugging

Control Plane Log Types

Enabling Logs

CloudWatch Logs Insights Queries

API Server Errors (400+) Analysis

Authentication Failure Tracking

aws-auth ConfigMap Change Detection

API Throttling Detection

Unauthorized Access Attempts (Security Events)

Authentication/Authorization Debugging

IAM Authentication Verification

aws-auth ConfigMap (CONFIG_MAP Mode)

EKS Access Entries (API / API_AND_CONFIG_MAP Mode)

IRSA (IAM Roles for Service Accounts) Debugging Checklist

Service Account Token Expiry (HTTP 401 Unauthorized)

EKS Pod Identity Debugging

EKS Add-on Troubleshooting

Cluster Health Issue Codes

4. Node-Level Debugging

Node Join Failure Debugging

Node NotReady Decision Tree

kubelet / containerd Debugging

Resource Pressure Diagnosis and Resolution

Karpenter Node Provisioning Debugging

Managed Node Group Error Codes

Debugging Node Bootstrap with Node Readiness Controller

Problem Scenario

How Node Readiness Controller Works

Debugging Checklist

Related Feature: Pod Scheduling Readiness (K8s 1.30 GA)

Related Feature: Pod Readiness Gates (AWS LB Controller)

Using eks-node-viewer

5. Workload Debugging

Pod Status Debugging Flowchart

Basic Debugging Commands

Using kubectl debug

Ephemeral Container (Add a debug container to a running Pod)

Pod Copy (Clone a Pod for debugging)

Node Debugging (Direct access to a node)

Deployment Rollout Debugging

HPA / VPA Debugging

Probe Debugging and Best Practices

6. Networking Debugging

Networking Debugging Workflow

VPC CNI Debugging

DNS Troubleshooting

Service Debugging

NetworkPolicy Debugging

Using netshoot

7. Storage Debugging

Storage Debugging Decision Tree

EBS CSI Driver Debugging

EFS CSI Driver Debugging

PV/PVC Status Inspection and Stuck Resolution

8. Observability and Monitoring

Observability Stack Architecture

Container Insights Setup

Metric Debugging: PromQL Queries

CPU Throttling Detection

OOMKilled Detection

Pod Restart Rate

Node CPU Utilization (Warning above 80%)

Node Memory Utilization (Warning above 85%)

Log Debugging: CloudWatch Logs Insights

Error Log Analysis

Latency Analysis

Error Pattern Analysis for a Specific Pod