Skip to main content

📅 Written: 2026-02-10 | Last Modified: 2026-02-13 | ⏱️ Reading Time: ~20 min

EKS Incident Diagnosis and Response Guide

📅 Published: 2026-02-10 | ⏱️ Reading time: ~25 min

📌 Reference environment: EKS 1.30+, kubectl 1.30+, AWS CLI v2

1. Overview

Issues that arise during EKS operations span multiple layers including the control plane, nodes, networking, workloads, storage, and observability. This document is a comprehensive debugging guide designed to help SREs, DevOps engineers, and platform teams systematically diagnose and quickly resolve these issues.

All commands and examples are written to be immediately executable, and Decision Trees and flowcharts are provided to facilitate rapid decision-making.

EKS Debugging Layers

Debugging Methodology

There are two approaches to diagnosing EKS issues.

ApproachDescriptionSuitable Scenarios
Top-down (Symptom → Cause)Start from user-reported symptoms and trace back to the root causeImmediate incident response such as service outages or performance degradation
Bottom-up (Infra → App)Inspect sequentially from the infrastructure layer upwardPreventive inspections, post-migration validation
Recommended General Approach

For production incidents, the top-down approach is recommended. First identify the symptoms (Section 2 - Incident Triage), then navigate to the corresponding debugging section for that layer.


2. Incident Triage (Rapid Failure Assessment)

First 5 Minutes Checklist

When an incident occurs, the most important actions are scope identification and initial response.

30 Seconds: Initial Diagnosis

# Check cluster status
aws eks describe-cluster --name <cluster-name> --query 'cluster.status' --output text

# Check node status
kubectl get nodes

# Check unhealthy Pods
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

2 Minutes: Scope Identification

# Check recent events (all namespaces)
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

# Aggregate Pod status in a specific namespace
kubectl get pods -n <namespace> --no-headers | awk '{print $3}' | sort | uniq -c | sort -rn

# Check distribution of unhealthy Pods by node
kubectl get pods --all-namespaces -o wide --field-selector=status.phase!=Running | \
awk 'NR>1 {print $8}' | sort | uniq -c | sort -rn

5 Minutes: Initial Response

# Detailed information for the problematic Pod
kubectl describe pod <pod-name> -n <namespace>

# Previous container logs (for CrashLoopBackOff)
kubectl logs <pod-name> -n <namespace> --previous

# Check resource usage
kubectl top nodes
kubectl top pods -n <namespace> --sort-by=cpu

Scope Identification Decision Tree

AZ Failure Detection

AWS Health API Requirement

The aws health describe-events API requires an AWS Business or Enterprise Support plan. If you don't have a Support plan, check the AWS Health Dashboard console directly or create an EventBridge rule to capture Health events.

# Check AWS Health API for EKS/EC2 events (requires Business/Enterprise Support plan)
aws health describe-events \
--filter '{"services":["EKS","EC2"],"eventStatusCodes":["open"]}' \
--region us-east-1

# Alternative: Detect AZ failures without a Support plan — create EventBridge rule
aws events put-rule \
--name "aws-health-eks-events" \
--event-pattern '{
"source": ["aws.health"],
"detail-type": ["AWS Health Event"],
"detail": {
"service": ["EKS", "EC2"],
"eventTypeCategory": ["issue"]
}
}'

# Aggregate unhealthy Pods by AZ (only pods scheduled to a node)
kubectl get pods --all-namespaces -o json | jq -r '
.items[] |
select(.status.phase != "Running" and .status.phase != "Succeeded") |
select(.spec.nodeName != null) |
.spec.nodeName
' | sort -u | while read node; do
zone=$(kubectl get node "$node" -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}' 2>/dev/null)
[ -n "$zone" ] && echo "$zone"
done | sort | uniq -c | sort -rn

# Check ARC Zonal Shift status
aws arc-zonal-shift list-zonal-shifts \
--resource-identifier arn:aws:eks:region:account:cluster/name

AZ Failure Response Using ARC Zonal Shift

# Enable Zonal Shift on EKS
aws eks update-cluster-config \
--name <cluster-name> \
--zonal-shift-config enabled=true

# Start manual Zonal Shift (move traffic away from impaired AZ)
aws arc-zonal-shift start-zonal-shift \
--resource-identifier arn:aws:eks:region:account:cluster/name \
--away-from us-east-1a \
--expires-in 3h \
--comment "AZ impairment detected"
Zonal Shift Considerations

The maximum duration for a Zonal Shift is 3 days and can be extended. Once a Shift is initiated, new traffic to Pods running on nodes in the affected AZ is blocked, so verify that sufficient capacity exists in the other AZs before proceeding.

Zonal Shift Only Blocks Traffic

ARC Zonal Shift only changes traffic routing at the Load Balancer / Service level.

⚡ ARC Zonal Shift 영향 범위
Zonal Shift는 트래픽 라우팅만 변경합니다 — 각 계층별 영향 확인
계층Zonal Shift 영향자동 조정수동 작업
🔀 ALB / NLB해당 AZ Target Group에서 제거-
🔀 EKS Service (kube-proxy)해당 AZ의 Endpoint 가중치 제거-
💻 기존 노드계속 실행됨kubectl drain 으로 Pod 이동
📦 기존 Pod트래픽만 차단, Pod 자체는 실행 중drain 시 자동 재배치
⚙️ Karpenter NodePoolAZ 설정 변경 없음, 해당 AZ에 새 노드 생성 가능NodePool requirements 수정
📊 ASG (Managed Node Group)서브넷 목록 변경 없음, 해당 AZ에 스케일아웃 가능ASG 서브넷 수정 (콘솔/IaC)
💾 EBS 볼륨AZ에 고정, 이동 불가스냅샷 → 다른 AZ에 복원
📁 EFS Mount Target다른 AZ의 Mount Target 자동 사용-

Karpenter NodePool and ASG (Managed Node Group) AZ configurations are NOT automatically updated. A complete AZ evacuation requires additional steps:

  1. Start Zonal Shift → blocks new traffic (automatic)
  2. Drain nodes in the affected AZ → relocate existing Pods
  3. Remove the affected AZ from Karpenter NodePool or ASG subnets → prevent new node provisioning
# 1. Identify and drain nodes in the affected AZ
for node in $(kubectl get nodes -l topology.kubernetes.io/zone=us-east-1a -o name); do
kubectl cordon $node
kubectl drain $node --ignore-daemonsets --delete-emptydir-data --grace-period=60
done

# 2. Temporarily exclude the affected AZ from Karpenter NodePool
kubectl patch nodepool default --type=merge -p '{
"spec": {"template": {"spec": {"requirements": [
{"key": "topology.kubernetes.io/zone", "operator": "In", "values": ["us-east-1b", "us-east-1c"]}
]}}}
}'

# 3. For Managed Node Groups, update ASG subnets (via console or IaC)

Remember to revert these changes after the Zonal Shift is cancelled.

CloudWatch Anomaly Detection

# Set up Anomaly Detection alarm for Pod restart counts
aws cloudwatch put-anomaly-detector \
--single-metric-anomaly-detector '{
"Namespace": "ContainerInsights",
"MetricName": "pod_number_of_container_restarts",
"Dimensions": [
{"Name": "ClusterName", "Value": "<cluster-name>"},
{"Name": "Namespace", "Value": "production"}
],
"Stat": "Average"
}'

Incident Response Escalation Matrix

🚨 인시던트 대응 에스컬레이션 매트릭스
심각도별 초동 대응 시간 및 에스컬레이션 경로
🔴P1 - Critical5분 이내
에스컬레이션: 즉시 온콜 + 관리자컨트롤 플레인 장애, 전체 노드 NotReady
🟠P2 - High15분 이내
에스컬레이션: 온콜 팀특정 AZ 장애, 다수 Pod CrashLoopBackOff
🟡P3 - Medium1시간 이내
에스컬레이션: 담당 팀HPA 스케일링 실패, 간헐적 타임아웃
🔵P4 - Low4시간 이내
에스컬레이션: 백로그단일 Pod 재시작, 비프로덕션 환경 이슈
High Availability Architecture Guide Reference

For architecture-level failure recovery strategies (TopologySpreadConstraints, PodDisruptionBudget, Multi-AZ deployments, etc.), refer to the EKS High Availability Architecture Guide.


3. EKS Control Plane Debugging

Control Plane Log Types

The EKS control plane can send 5 log types to CloudWatch Logs.

📋 EKS 컨트롤 플레인 로그 타입
로그 그룹: /aws/eks/<cluster-name>/cluster
apikube-apiserver
API 요청/응답 기록kube-apiserver-audit-*
auditkube-apiserver-audit
감사 로그 (누가, 무엇을, 언제)kube-apiserver-audit-*
authenticatoraws-iam-authenticator
IAM 인증 이벤트authenticator-*
controllerManagerkube-controller-manager
컨트롤러 동작 로그kube-controller-manager-*
schedulerkube-scheduler
스케줄링 결정 및 실패scheduler-*

Enabling Logs

# Enable all control plane logs
aws eks update-cluster-config \
--region <region> \
--name <cluster-name> \
--logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'
Cost Optimization

Enabling all log types increases CloudWatch Logs costs. For production, it is recommended to enable audit and authenticator as mandatory, and enable the remaining types only when debugging is needed.

CloudWatch Logs Insights Queries

API Server Errors (400+) Analysis

fields @timestamp, @message
| filter @logStream like /kube-apiserver-audit/
| filter responseStatus.code >= 400
| stats count() by responseStatus.code
| sort count desc

Authentication Failure Tracking

fields @timestamp, @message
| filter @logStream like /authenticator/
| filter @message like /error/ or @message like /denied/
| sort @timestamp desc

aws-auth ConfigMap Change Detection

fields @timestamp, @message
| filter @logStream like /kube-apiserver-audit/
| filter objectRef.resource = "configmaps" and objectRef.name = "aws-auth"
| filter verb in ["update", "patch", "delete"]
| sort @timestamp desc

API Throttling Detection

fields @timestamp, @message
| filter @logStream like /kube-apiserver/
| filter @message like /throttle/ or @message like /rate limit/
| stats count() by bin(5m)

Unauthorized Access Attempts (Security Events)

fields @timestamp, @message
| filter @logStream like /kube-apiserver-audit/
| filter responseStatus.code = 403
| stats count() by user.username
| sort count desc

Authentication/Authorization Debugging

IAM Authentication Verification

# Check current IAM credentials
aws sts get-caller-identity

# Check cluster authentication mode
aws eks describe-cluster --name <cluster-name> \
--query 'cluster.accessConfig.authenticationMode' --output text

aws-auth ConfigMap (CONFIG_MAP Mode)

# Check aws-auth ConfigMap
kubectl describe configmap aws-auth -n kube-system

EKS Access Entries (API / API_AND_CONFIG_MAP Mode)

# Create Access Entry
aws eks create-access-entry \
--cluster-name <cluster-name> \
--principal-arn arn:aws:iam::ACCOUNT:role/ROLE-NAME \
--type STANDARD

# List Access Entries
aws eks list-access-entries --cluster-name <cluster-name>

IRSA (IAM Roles for Service Accounts) Debugging Checklist

# 1. Verify annotation on ServiceAccount
kubectl get sa <sa-name> -n <namespace> -o yaml

# 2. Check AWS environment variables inside the Pod
kubectl exec -it <pod-name> -- env | grep AWS

# 3. Verify OIDC Provider
aws eks describe-cluster --name <cluster-name> \
--query 'cluster.identity.oidc.issuer' --output text

# 4. Verify OIDC Provider ARN and conditions in IAM Role Trust Policy
aws iam get-role --role-name <role-name> \
--query 'Role.AssumeRolePolicyDocument'
Common IRSA Mistakes
  • Typo in the role ARN in the ServiceAccount annotation
  • Mismatch in namespace/sa name in the IAM Role Trust Policy
  • OIDC Provider not associated with the cluster
  • spec.serviceAccountName not specified for the Pod to use the ServiceAccount

Service Account Token Expiry (HTTP 401 Unauthorized)

In Kubernetes 1.21+, service account tokens are valid for 1 hour by default and are automatically refreshed by the kubelet. However, if you are using a legacy SDK that lacks token refresh logic, long-running workloads may encounter 401 Unauthorized errors.

Symptoms:

  • Pod suddenly returns HTTP 401 Unauthorized errors after a certain period (typically 1 hour)
  • Works normally temporarily after a restart, then 401 errors recur

Cause:

  • Projected Service Account Tokens expire after 1 hour by default
  • The kubelet automatically refreshes the token, but if the application reads the token file only once and caches it, the expired token continues to be used

Minimum Required SDK Versions:

LanguageSDKMinimum Version
Goclient-gov0.15.7+
Pythonkubernetes12.0.0+
Javafabric85.0.0+
Token Refresh Verification

Verify that your SDK supports automatic token refresh. If it does not, your application must periodically re-read the /var/run/secrets/kubernetes.io/serviceaccount/token file.

EKS Pod Identity Debugging

EKS Pod Identity is an alternative to IRSA that provides a simpler setup for granting AWS IAM permissions to Pods.

# Check Pod Identity Associations
aws eks list-pod-identity-associations --cluster-name $CLUSTER
aws eks describe-pod-identity-association --cluster-name $CLUSTER \
--association-id $ASSOC_ID

# Check Pod Identity Agent status
kubectl get pods -n kube-system -l app.kubernetes.io/name=eks-pod-identity-agent
kubectl logs -n kube-system -l app.kubernetes.io/name=eks-pod-identity-agent --tail=50

Pod Identity Debugging Checklist:

  • Verify the eks-pod-identity-agent Add-on is installed
  • Verify the correct association is linked to the Pod's ServiceAccount
  • Verify the IAM Role trust policy includes the pods.eks.amazonaws.com service principal
Pod Identity vs IRSA

Pod Identity has a simpler setup than IRSA and makes cross-account access easier. Pod Identity is recommended for new workloads.

EKS Add-on Troubleshooting

# List Add-ons
aws eks list-addons --cluster-name <cluster-name>

# Check Add-on status in detail
aws eks describe-addon --cluster-name <cluster-name> --addon-name <addon-name>

# Update Add-on (resolve conflicts with PRESERVE to keep existing settings)
aws eks update-addon --cluster-name <cluster-name> --addon-name <addon-name> \
--addon-version <version> --resolve-conflicts PRESERVE
Add-onCommon Error PatternsDiagnosisResolution
CoreDNSPod CrashLoopBackOff, DNS timeoutskubectl logs -n kube-system -l k8s-app=kube-dnsCheck ConfigMap, kubectl rollout restart deployment coredns -n kube-system
kube-proxyService communication failure, iptables errorskubectl logs -n kube-system -l k8s-app=kube-proxyVerify DaemonSet image version, kubectl rollout restart daemonset kube-proxy -n kube-system
VPC CNIPod IP allocation failure, ENI errorskubectl logs -n kube-system -l k8s-app=aws-nodeCheck IPAMD logs, verify ENI/IP limits (see Section 6)
EBS CSIPVC Pending, volume attach failurekubectl logs -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driverCheck IRSA permissions, verify AZ matching (see Section 7)

Cluster Health Issue Codes

When diagnosing infrastructure-level issues with the EKS cluster itself, check the cluster health status.

# Check cluster health issues
aws eks describe-cluster --name $CLUSTER \
--query 'cluster.health' --output json
🏥 클러스터 헬스 이슈 코드
aws eks describe-cluster --query 'cluster.health'
SUBNET_NOT_FOUND
클러스터 서브넷이 삭제됨
새 서브넷 연결 필요
⚠️ 조건부 복구
SECURITY_GROUP_NOT_FOUND
클러스터 보안그룹이 삭제됨
보안그룹 재생성
⚠️ 조건부 복구
IP_NOT_AVAILABLE
서브넷에 IP 부족
서브넷 추가/확장
복구 가능
VPC_NOT_FOUND
VPC가 삭제됨
클러스터 재생성 필요
복구 불가
ASSUME_ROLE_ACCESS_DENIED
클러스터 IAM Role 권한 문제
IAM 정책 수정
복구 가능
KMS_KEY_DISABLED
Secrets 암호화 KMS 키 비활성화
KMS 키 재활성화
복구 가능
KMS_KEY_NOT_FOUND
KMS 키 삭제됨
복구 불가
복구 불가
Unrecoverable Issues

VPC_NOT_FOUND and KMS_KEY_NOT_FOUND are unrecoverable. The cluster must be recreated.


4. Node-Level Debugging

Node Join Failure Debugging

When a node fails to join the cluster, there can be various causes. Below are the 8 most common causes and their diagnostic methods.

Common Causes of Node Join Failure:

  1. Node IAM Role not registered in aws-auth ConfigMap (or Access Entry not created) — The node cannot authenticate with the API server
  2. ClusterName in bootstrap script does not match the actual cluster name — kubelet attempts to connect to the wrong cluster
  3. Node security group does not allow communication with the control plane — TCP 443 (API server) and TCP 10250 (kubelet) ports are required
  4. Auto-assign public IP is disabled in a public subnet — Cannot access the internet on clusters with only the public endpoint enabled
  5. VPC DNS configuration issueenableDnsHostnames or enableDnsSupport is disabled
  6. STS regional endpoint is disabled — STS call fails during IAM authentication
  7. Instance profile ARN registered in aws-auth instead of node IAM Role ARN — Only the Role ARN should be registered in aws-auth
  8. eks:kubernetes.io/cluster-name tag missing (self-managed nodes) — EKS cannot recognize the node as belonging to the cluster

Diagnostic Commands:

# Check node bootstrap logs (after SSM access)
sudo journalctl -u kubelet --no-pager | tail -50
sudo cat /var/log/cloud-init-output.log | tail -50

# Check security group rules
aws ec2 describe-security-groups --group-ids $CLUSTER_SG \
--query 'SecurityGroups[].IpPermissions' --output table

# Check VPC DNS settings
aws ec2 describe-vpc-attribute --vpc-id $VPC_ID --attribute enableDnsHostnames
aws ec2 describe-vpc-attribute --vpc-id $VPC_ID --attribute enableDnsSupport
ARN to Register in aws-auth

The aws-auth ConfigMap requires the IAM Role ARN (arn:aws:iam::ACCOUNT:role/...), not the instance profile ARN (arn:aws:iam::ACCOUNT:instance-profile/...). This is an extremely common mistake and a leading cause of node join failures.

Node NotReady Decision Tree

kubelet / containerd Debugging

# Connect to node via SSM
aws ssm start-session --target <instance-id>

# Check kubelet status
systemctl status kubelet
journalctl -u kubelet -n 100 -f

# Check containerd status
systemctl status containerd

# Check container runtime status
crictl pods
crictl ps -a

# Check logs for a specific container
crictl logs <container-id>
SSM Access Prerequisites

SSM access requires the AmazonSSMManagedInstanceCore policy to be attached to the node's IAM Role. This is included by default in EKS managed node groups, but if you are using a custom AMI, verify that the SSM Agent is installed.

Resource Pressure Diagnosis and Resolution

# Check node conditions
kubectl describe node <node-name>
ConditionThresholdDiagnostic CommandResolution
DiskPressureAvailable disk < 10%df -h (after SSM access)Clean unused images with crictl rmi --prune, remove stopped containers with crictl rm
MemoryPressureAvailable memory < 100Mifree -m (after SSM access)Evict low-priority Pods, adjust memory requests/limits, replace node
PIDPressureAvailable PIDs < 5%ps aux | wc -l (after SSM access)Increase kernel.pid_max, identify and restart the container causing the PID leak

Karpenter Node Provisioning Debugging

# Check Karpenter controller logs
kubectl logs -f deployment/karpenter -n kube-system

# Check NodePool status
kubectl get nodepool
kubectl describe nodepool <nodepool-name>

# Check EC2NodeClass
kubectl get ec2nodeclass
kubectl describe ec2nodeclass <nodeclass-name>

# When provisioning fails, verify:
# 1. NodePool limits have not been exceeded
# 2. EC2NodeClass subnet/security group selectors are correct
# 3. Service Quotas are sufficient for the instance types
# 4. Pod nodeSelector/affinity matches NodePool requirements
Karpenter v1 API Changes

In Karpenter v1.0+, Provisioner has been renamed to NodePool and AWSNodeTemplate to EC2NodeClass. If you are using v0.x configurations, migration is required. Update the API group to karpenter.sh/v1.

Managed Node Group Error Codes

Check the health status of Managed Node Groups to diagnose provisioning and operational issues.

# Check node group health status
aws eks describe-nodegroup --cluster-name $CLUSTER --nodegroup-name $NODEGROUP \
--query 'nodegroup.health' --output json
🖥️ Managed Node Group 에러 코드
aws eks describe-nodegroup --query 'nodegroup.health'
🔴AccessDenied
노드 IAM Role에 필요한 권한 부족
eks:node-manager ClusterRole/ClusterRoleBinding 확인 및 복구
🟡AmiIdNotFound
Launch Template의 AMI ID가 존재하지 않음
유효한 EKS optimized AMI ID로 업데이트
🔴AutoScalingGroupNotFound
ASG가 삭제되었거나 존재하지 않음
노드 그룹 삭제 후 재생성
🔴ClusterUnreachable
노드가 EKS API 서버에 연결 불가
VPC 설정, 보안그룹, 엔드포인트 접근성 확인
🟡Ec2SecurityGroupNotFound
지정된 보안그룹이 삭제됨
올바른 보안그룹 생성 후 노드그룹 재구성
🟡Ec2LaunchTemplateNotFound
Launch Template이 삭제됨
새 Launch Template 생성 후 노드그룹 업데이트
Ec2LaunchTemplateVersionMismatch
Launch Template 버전 불일치
노드그룹이 참조하는 버전 확인 및 수정
🟡IamInstanceProfileNotFound
인스턴스 프로파일이 존재하지 않음
IAM 인스턴스 프로파일 재생성
🔴IamNodeRoleNotFound
노드 IAM Role이 삭제됨
IAM Role 재생성 후 필요 정책 연결
🟡AsgInstanceLaunchFailures
EC2 인스턴스 시작 실패 (용량 부족 등)
다른 인스턴스 타입/AZ 추가, Service Quotas 확인
🟡NodeCreationFailure
노드 생성 일반 실패
CloudTrail에서 상세 에러 확인
🟡InstanceLimitExceeded
EC2 인스턴스 한도 초과
Service Quotas에서 한도 증가 요청
🟡InsufficientFreeAddresses
서브넷의 가용 IP 주소 부족
서브넷 CIDR 확장 또는 새 서브넷 추가
InternalFailure
AWS 내부 오류
재시도, 지속시 AWS Support 문의

AccessDenied Error Recovery -- Checking eks:node-manager ClusterRole:

The AccessDenied error typically occurs when the eks:node-manager ClusterRole or ClusterRoleBinding has been deleted or modified.

# Check eks:node-manager ClusterRole
kubectl get clusterrole eks:node-manager
kubectl get clusterrolebinding eks:node-manager
AccessDenied Recovery

If the eks:node-manager ClusterRole/ClusterRoleBinding is missing, EKS does not automatically restore them. You must recover manually using one of these methods:

Method 1: Manual Recreation (Recommended)

# eks-node-manager-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: eks:node-manager
rules:
- apiGroups: ['']
resources: [pods]
verbs: [get, list, watch, delete]
- apiGroups: ['']
resources: [nodes]
verbs: [get, list, watch, patch]
- apiGroups: ['']
resources: [pods/eviction]
verbs: [create]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: eks:node-manager
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: eks:node-manager
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: eks:node-manager
kubectl auth reconcile -f eks-node-manager-role.yaml

Method 2: Recreate Node Group

# RBAC resources are created together when creating a new node group
eksctl create nodegroup --cluster=<cluster-name> --name=<new-nodegroup-name>

Method 3: Upgrade Node Group

# Upgrade process may trigger RBAC re-setup
eksctl upgrade nodegroup --cluster=<cluster-name> --name=<nodegroup-name>

Note: Kubernetes default system ClusterRoles (system:*) are auto-reconciled by the API server, but EKS-specific ClusterRoles (eks:*) are not auto-restored. Always backup RBAC resources before deletion.

Debugging Node Bootstrap with Node Readiness Controller

New Kubernetes Feature (February 2026)

Node Readiness Controller is a new project announced on the official Kubernetes blog that declaratively solves premature scheduling issues during node bootstrapping.

Problem Scenario

In standard Kubernetes, workloads are scheduled as soon as a node reaches Ready state. However, the node may not actually be fully prepared:

Incomplete ComponentSymptomImpact
GPU driver/firmware loadingnvidia-smi failure, Pod CrashLoopBackOffGPU workload failure
CNI plugin initializingPod IP unassigned, NetworkNotReadyNetwork communication failure
CSI driver not registeredPVC Pending, volume mount failureStorage inaccessible
Security agent not installedCompliance violationSecurity policy unmet

How Node Readiness Controller Works

Node Readiness Controller declaratively manages custom taints, delaying workload scheduling until all infrastructure requirements are met:

Debugging Checklist

When a node is Ready but Pods are not being scheduled:

# 1. Check custom readiness taints on the node
kubectl get node <node-name> -o jsonpath='{.spec.taints}' | jq .

# 2. Filter for node.readiness related taints
kubectl get nodes -o json | jq '
.items[] |
select(.spec.taints // [] | any(.key | startswith("node.readiness"))) |
{name: .metadata.name, taints: [.spec.taints[] | select(.key | startswith("node.readiness"))]}
'

# 3. Check Pod tolerations vs node taint mismatch
kubectl describe pod <pending-pod> | grep -A 20 "Events:"

schedulingGates allow controlling scheduling readiness from the Pod side:

apiVersion: v1
kind: Pod
metadata:
name: gated-pod
spec:
schedulingGates:
- name: "example.com/gpu-validation" # Scheduling waits until this gate is removed
containers:
- name: app
image: app:latest
# Find Pods with schedulingGates
kubectl get pods -o json | jq '
.items[] |
select(.spec.schedulingGates != null and (.spec.schedulingGates | length > 0)) |
{name: .metadata.name, namespace: .metadata.namespace, gates: .spec.schedulingGates}
'

AWS Load Balancer Controller uses the elbv2.k8s.aws/pod-readiness-gate-inject annotation to delay Pod Ready state transition until ALB/NLB target registration is complete:

# Check Readiness Gate status
kubectl get pod <pod-name> -o jsonpath='{.status.conditions}' | jq '
[.[] | select(.type | contains("target-health"))]
'

# Check if readiness gate injection is enabled for namespace
kubectl get namespace <ns> -o jsonpath='{.metadata.labels.elbv2\.k8s\.aws/pod-readiness-gate-inject}'
Readiness Feature Comparison
FeatureTargetControl MechanismStatus
Node Readiness ControllerNodeTaint-basedNew (Feb 2026)
Pod Scheduling ReadinessPodschedulingGatesGA (K8s 1.30)
Pod Readiness GatesPodReadiness ConditionsGA (AWS LB Controller)

Using eks-node-viewer

eks-node-viewer is a tool that visualizes node resource utilization in real-time in the terminal.

# Basic usage (CPU-based)
eks-node-viewer

# View both CPU and memory
eks-node-viewer --resources cpu,memory

# View a specific NodePool only
eks-node-viewer --node-selector karpenter.sh/nodepool=<nodepool-name>

5. Workload Debugging

Pod Status Debugging Flowchart

Basic Debugging Commands

# Check Pod status
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

# Check current/previous container logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

# Check namespace events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Check resource usage
kubectl top pods -n <namespace>

Using kubectl debug

Ephemeral Container (Add a debug container to a running Pod)

# Basic ephemeral container
kubectl debug <pod-name> -it --image=busybox --target=<container-name>

# Image with network debugging tools
kubectl debug <pod-name> -it --image=nicolaka/netshoot --target=<container-name>

Pod Copy (Clone a Pod for debugging)

# Clone a Pod and start with a different image
kubectl debug <pod-name> --copy-to=debug-pod --image=ubuntu

# Change the command when cloning a Pod
kubectl debug <pod-name> --copy-to=debug-pod --container=<container-name> -- sh

Node Debugging (Direct access to a node)

# Node debugging (host filesystem is mounted at /host)
kubectl debug node/<node-name> -it --image=ubuntu
kubectl debug vs SSM

kubectl debug node/ can be used even on nodes where the SSM Agent is not installed. However, to access the host network namespace, add the --profile=sysadmin option.

Deployment Rollout Debugging

# Check rollout status
kubectl rollout status deployment/<name>

# Rollout history
kubectl rollout history deployment/<name>

# Roll back to the previous version
kubectl rollout undo deployment/<name>

# Roll back to a specific revision
kubectl rollout undo deployment/<name> --to-revision=2

# Restart Deployment (Rolling restart)
kubectl rollout restart deployment/<name>

HPA / VPA Debugging

# Check HPA status
kubectl get hpa
kubectl describe hpa <hpa-name>

# Verify metrics-server is running
kubectl get deployment metrics-server -n kube-system
kubectl top pods # If this command fails, metrics-server has an issue

# Check scaling failure reasons in HPA events
kubectl describe hpa <hpa-name> | grep -A 5 "Events"

HPA Scaling Failure Analysis:

SymptomCauseResolution
unable to get metricsmetrics-server not installed or failingCheck metrics-server Pod status and restart
current metrics unknownMetric collection failing from target PodsVerify resource requests are set on the Pod
target not foundscaleTargetRef mismatchVerify Deployment/StatefulSet name and apiVersion
Scale-up followed by immediate scale-downstabilizationWindow not configuredSet behavior.scaleDown.stabilizationWindowSeconds

Probe Debugging and Best Practices

# Recommended Probe configuration example
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
template:
spec:
containers:
- name: app
image: my-app:latest
ports:
- containerPort: 8080
# Startup Probe: Confirms app startup completion (essential for slow-starting apps)
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # Wait up to 300 seconds (30 x 10s)
periodSeconds: 10
# Liveness Probe: Checks if the app is alive (deadlock detection)
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1
# Readiness Probe: Checks if the app can receive traffic
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 1
Probe Configuration Warnings
  • Do not include external dependencies in the Liveness Probe (such as database connectivity checks). This can trigger cascading failures where all Pods restart when an external service goes down.
  • Do not set a high initialDelaySeconds without a startupProbe. Since liveness/readiness probes are disabled until the startupProbe succeeds, use a startupProbe for slow-starting applications.
  • A Readiness Probe failure does not restart the Pod; it only removes the Pod from the Service Endpoints.

6. Networking Debugging

Networking Debugging Workflow

VPC CNI Debugging

# Check VPC CNI Pod status
kubectl get pods -n kube-system -l k8s-app=aws-node

# Check VPC CNI logs
kubectl logs -n kube-system -l k8s-app=aws-node --tail=50

# Check current VPC CNI version
kubectl describe daemonset aws-node -n kube-system | grep Image

Resolving IP Exhaustion:

# Check available IPs per subnet
aws ec2 describe-subnets --subnet-ids <subnet-id> \
--query 'Subnets[].{ID:SubnetId,AZ:AvailabilityZone,Available:AvailableIpAddressCount}'

# Enable Prefix Delegation (16x IP capacity increase)
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true

ENI Limits and IP Quotas:

Each EC2 instance type has limits on the number of ENIs that can be attached and the number of IPs per ENI. Enabling Prefix Delegation significantly increases the IP allocation per ENI.

DNS Troubleshooting

# Check CoreDNS Pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Test DNS resolution
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

# Check CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml

# Restart CoreDNS
kubectl rollout restart deployment coredns -n kube-system
ndots Issue

In the default Kubernetes resolv.conf configuration, ndots:5 causes cluster-internal DNS suffixes to be tried first for domains with fewer than 5 dots. When accessing external domains, this results in 4 additional unnecessary DNS queries, increasing latency.

Solution: Set ndots:2 via dnsConfig.options in the Pod spec, or append a trailing . to external domain names (e.g., api.example.com.).

Note: The VPC DNS throttling limit is 1,024 packets/sec per ENI.

Service Debugging

# Check Service status
kubectl get svc <service-name>

# Check Endpoints (whether backend Pods are connected)
kubectl get endpoints <service-name>

# Detailed Service information (verify selector)
kubectl describe svc <service-name>

# Check Selector
kubectl get svc <service-name> -o jsonpath='{.spec.selector}'

# Find Pods matching the Selector
kubectl get pods -l <key>=<value>

Common Service Issues:

SymptomWhat to CheckResolution
Endpoints are emptyService selector and Pod label mismatchFix labels
ClusterIP unreachableWhether kube-proxy is running normallykubectl logs -n kube-system -l k8s-app=kube-proxy
NodePort unreachableWhether Security Group allows ports 30000-32767Add SG inbound rule
LoadBalancer PendingWhether AWS Load Balancer Controller is installedInstall controller and verify IAM permissions

NetworkPolicy Debugging

The most common mistake with NetworkPolicy is confusing AND vs OR selectors.

# AND logic (combining two selectors within the same from entry)
# Allow only "Pods with role client in the alice namespace"
- from:
- namespaceSelector:
matchLabels:
user: alice
podSelector:
matchLabels:
role: client

# OR logic (separating into distinct from entries)
# Allow "all Pods in the alice namespace" OR "Pods with role client in any namespace"
- from:
- namespaceSelector:
matchLabels:
user: alice
- podSelector:
matchLabels:
role: client
AND vs OR Caution

The two YAML examples above differ by only one indentation level but result in completely different security policies. In AND logic, namespaceSelector and podSelector are within the same - from entry, while in OR logic, they are separate - from entries.

Using netshoot

netshoot is a container image that includes all the tools needed for network debugging.

# Add as an ephemeral container to an existing Pod
kubectl debug <pod-name> -it --image=nicolaka/netshoot

# Run a standalone debugging Pod
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot

# Available tools inside include:
# - curl, wget: HTTP testing
# - dig, nslookup: DNS testing
# - tcpdump: Packet capture
# - iperf3: Bandwidth testing
# - ss, netstat: Socket status inspection
# - traceroute, mtr: Route tracing

Practical Debugging Scenario: Verifying Pod-to-Pod Communication

# Test connectivity to another Service from a netshoot Pod
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- bash

# Verify DNS resolution
dig <service-name>.<namespace>.svc.cluster.local

# Test TCP connectivity
curl -v http://<service-name>.<namespace>.svc.cluster.local:<port>/health

# Capture packets (traffic to a specific Pod IP)
tcpdump -i any host <pod-ip> -n

7. Storage Debugging

Storage Debugging Decision Tree

EBS CSI Driver Debugging

# Check EBS CSI Driver Pod status
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver

# Check Controller logs
kubectl logs -n kube-system -l app=ebs-csi-controller -c ebs-plugin --tail=100

# Check Node logs
kubectl logs -n kube-system -l app=ebs-csi-node -c ebs-plugin --tail=100

# Verify IRSA ServiceAccount
kubectl describe sa ebs-csi-controller-sa -n kube-system

EBS CSI Driver Error Patterns:

Error MessageCauseResolution
could not create volumeInsufficient IAM permissionsAdd ec2:CreateVolume, ec2:AttachVolume, etc. to the IRSA Role
volume is already attached to another nodeNot detached from previous nodeClean up previous Pod/node, wait for EBS volume detach (~6 min)
could not attach volume: already at maxInstance EBS volume limit exceededUse a larger instance type (Nitro instances: up to 128 volumes)
failed to provision volume with StorageClassStorageClass does not exist or misconfiguredVerify StorageClass name and parameters

Recommended StorageClass Configuration:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: topology-aware-ebs
provisioner: ebs.csi.aws.com
parameters:
type: gp3
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
WaitForFirstConsumer

Using volumeBindingMode: WaitForFirstConsumer defers PVC binding until Pod scheduling time. This ensures that the volume is created in the same AZ where the Pod is scheduled, preventing AZ mismatch issues.

EFS CSI Driver Debugging

# Check EFS CSI Driver Pod status
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-efs-csi-driver

# Check Controller logs
kubectl logs -n kube-system -l app=efs-csi-controller -c efs-plugin --tail=100

# Check EFS filesystem status
aws efs describe-file-systems --file-system-id <fs-id>

# Verify Mount Targets (must exist in each AZ)
aws efs describe-mount-targets --file-system-id <fs-id>

EFS Checklist:

  • Verify that Mount Targets exist in the subnets of all AZs where Pods are running
  • Verify that the Mount Target's Security Group allows TCP 2049 (NFS) port
  • Verify that the node's Security Group permits outbound TCP 2049 to the EFS Mount Target

PV/PVC Status Inspection and Stuck Resolution

# Check PVC status
kubectl get pvc -n <namespace>

# Check PV status
kubectl get pv

# If PVC is stuck in Terminating (remove finalizer)
kubectl patch pvc <pvc-name> -n <namespace> -p '{"metadata":{"finalizers":null}}'

# Change PV from Released to Available (for reuse)
kubectl patch pv <pv-name> -p '{"spec":{"claimRef":null}}'
Manual Finalizer Removal Warning

Manually removing a finalizer may leave associated storage resources (such as EBS volumes) uncleaned. First verify that the volume is not in use, and check the AWS console to ensure no orphan volumes are created.


8. Observability and Monitoring

Observability Stack Architecture

Container Insights Setup

# Install Container Insights Add-on
aws eks create-addon \
--cluster-name <cluster-name> \
--addon-name amazon-cloudwatch-observability

# Verify installation
kubectl get pods -n amazon-cloudwatch

Metric Debugging: PromQL Queries

CPU Throttling Detection

sum(rate(container_cpu_cfs_throttled_periods_total{namespace="production"}[5m]))
/ sum(rate(container_cpu_cfs_periods_total{namespace="production"}[5m])) > 0.25
CPU Throttling Threshold

Throttling above 25% causes performance degradation. Consider removing or increasing CPU limits. Many organizations adopt a strategy of setting only CPU requests without CPU limits.

OOMKilled Detection

kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0

Pod Restart Rate

sum(rate(kube_pod_container_status_restarts_total[15m])) by (namespace, pod) > 0

Node CPU Utilization (Warning above 80%)

100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80

Node Memory Utilization (Warning above 85%)

(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85

Log Debugging: CloudWatch Logs Insights

Error Log Analysis

fields @timestamp, @message, kubernetes.container_name, kubernetes.pod_name
| filter @message like /ERROR|FATAL|Exception/
| sort @timestamp desc
| limit 50

Latency Analysis

fields @timestamp, @message
| filter @message like /latency|duration|elapsed/
| parse @message /latency[=:]\s*(?<latency_ms>\d+)/
| stats avg(latency_ms), max(latency_ms), p99(latency_ms) by bin(5m)

Error Pattern Analysis for a Specific Pod

fields @timestamp, @message
| filter kubernetes.pod_name like /api-server/
| filter @message like /error|Error|ERROR/
| stats count() by bin(1m)
| sort bin asc

OOMKilled Event Tracking

fields @timestamp, @message
| filter @message like /OOMKilled|oom-kill|Out of memory/
| sort @timestamp desc
| limit 20

Container Restart Events

fields @timestamp, @message, kubernetes.pod_name
| filter @message like /Back-off restarting failed container|CrashLoopBackOff/
| stats count() by kubernetes.pod_name
| sort count desc

Alert Rules: PrometheusRule Example

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
spec:
groups:
- name: kubernetes-pods
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 1h
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.pod }} has been restarting over the last 15 minutes."

- alert: PodOOMKilled
expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} OOMKilled"
description: "Pod {{ $labels.pod }} was terminated due to out of memory. Memory limits adjustment is required."

- name: kubernetes-nodes
rules:
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is NotReady"

- alert: NodeHighCPU
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} CPU usage above 80%"

- alert: NodeHighMemory
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} memory usage above 85%"

ADOT (AWS Distro for OpenTelemetry) Debugging

ADOT is the AWS-managed distribution of OpenTelemetry that collects traces, metrics, and logs and sends them to various AWS services (X-Ray, CloudWatch, AMP, etc.).

# Check ADOT Add-on status
aws eks describe-addon --cluster-name $CLUSTER \
--addon-name adot --query 'addon.{status:status,version:addonVersion}'

# Check ADOT Collector Pods
kubectl get pods -n opentelemetry-operator-system
kubectl logs -n opentelemetry-operator-system -l app.kubernetes.io/name=opentelemetry-operator --tail=50

# Check OpenTelemetryCollector CR
kubectl get otelcol -A
kubectl describe otelcol -n $NAMESPACE $COLLECTOR_NAME

Common ADOT Issues:

SymptomCauseResolution
Operator Pod CrashLoopBackOffCertManager not installedCertManager is required for ADOT operator webhook certificate management. kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
Collector fails to send to AMPInsufficient IAM permissionsAdd aps:RemoteWrite permission to IRSA/Pod Identity
X-Ray traces not receivedInsufficient IAM permissionsAdd xray:PutTraceSegments, xray:PutTelemetryRecords permissions to IRSA/Pod Identity
CloudWatch metrics not receivedInsufficient IAM permissionsAdd cloudwatch:PutMetricData permission to IRSA/Pod Identity
Collector Pod OOMKilledInsufficient resourcesIncrease the Collector's resources.limits.memory when collecting large volumes of traces/metrics
ADOT Permission Separation

AMP remote write, X-Ray, and CloudWatch each require different IAM permissions. If the Collector sends data to multiple backends, verify that all required permissions are included in the IAM Role.


9. Incident Detection Mechanisms and Logging Architecture

9.1 Incident Detection Strategy Overview

To rapidly detect incidents in an EKS environment, you must systematically build a 4-layer pipeline: Data Sources -> Collection -> Analysis & Detection -> Alerting & Response. Each layer must be organically connected to minimize MTTD (Mean Time To Detect).

4-Layer Architecture Description:

LayerRoleKey Components
Data SourcesGenerates all observable signals from the clusterControl Plane Logs, Data Plane Logs, Metrics, Traces
Collection LayerStandardizes and forwards data from various sources to a central locationFluent Bit, CloudWatch Agent, ADOT Collector
Analysis & DetectionAnalyzes collected data and detects anomaliesCloudWatch Logs Insights, AMP, OpenSearch, Anomaly Detection
Alerting & ResponseNotifies detected incidents through appropriate channels and executes auto-remediationCloudWatch Alarms, Alertmanager, SNS -> Lambda, PagerDuty/Slack

Option A: AWS Native Stack (Small to Medium Clusters)

An architecture centered on AWS managed services that minimizes operational overhead.

LayerComponentPurpose
CollectionFluent Bit (DaemonSet)Node/container log collection
TransportCloudWatch LogsCentral log store
AnalysisCloudWatch Logs InsightsQuery-based analysis
DetectionCloudWatch Anomaly DetectionML-based anomaly detection
AlertingCloudWatch Alarms -> SNSThreshold/anomaly-based alerting

Fluent Bit DaemonSet Deployment Example:

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: amazon-cloudwatch
labels:
app.kubernetes.io/name: fluent-bit
spec:
selector:
matchLabels:
app.kubernetes.io/name: fluent-bit
template:
metadata:
labels:
app.kubernetes.io/name: fluent-bit
spec:
serviceAccountName: fluent-bit
containers:
- name: fluent-bit
image: public.ecr.aws/aws-observability/aws-for-fluent-bit:2.32.0
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-config
Fluent Bit vs Fluentd

Fluent Bit uses more than 10x less memory than Fluentd (~10MB vs ~100MB). Deploying Fluent Bit as a DaemonSet is the standard pattern in EKS environments. Using the amazon-cloudwatch-observability Add-on automatically installs Fluent Bit.

Option B: Open-Source Based Stack (Large-Scale / Multi-Cluster)

An architecture that combines open-source tools with AWS managed services to ensure scalability and flexibility in large-scale environments.

LayerComponentPurpose
CollectionFluent Bit + ADOT CollectorUnified log/metric/trace collection
MetricsAmazon Managed Prometheus (AMP)Time-series metric storage
LogsAmazon OpenSearch ServiceLarge-scale log analytics
TracesAWS X-Ray / JaegerDistributed tracing
VisualizationAmazon Managed GrafanaUnified dashboards
AlertingAlertmanager + PagerDuty/SlackAdvanced routing, grouping, silencing
Multi-Cluster Architecture

In multi-cluster environments, a hub-and-spoke architecture is recommended where the ADOT Collector in each cluster sends metrics to a central AMP workspace. Grafana can monitor all clusters from a single dashboard.

9.3 Incident Detection Patterns

Pattern 1: Threshold-Based Detection

The most fundamental detection approach. An alert is triggered when a predefined threshold is exceeded.

# PrometheusRule - Threshold-based alert example
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: eks-threshold-alerts
namespace: monitoring
spec:
groups:
- name: eks-thresholds
rules:
- alert: HighPodRestartRate
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restart count increasing"
description: "{{ $value }} restarts detected within 1 hour"

- alert: NodeMemoryPressure
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.85
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} memory usage above 85%"

- alert: PVCNearlyFull
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} capacity above 90%"

Pattern 2: Anomaly Detection

Uses ML to learn normal patterns and detect deviations. Useful when it is difficult to predefine thresholds.

# CloudWatch Anomaly Detection setup
aws cloudwatch put-anomaly-detector \
--single-metric-anomaly-detector '{
"Namespace": "ContainerInsights",
"MetricName": "pod_cpu_utilization",
"Dimensions": [
{"Name": "ClusterName", "Value": "'$CLUSTER'"},
{"Name": "Namespace", "Value": "production"}
],
"Stat": "Average"
}'

# Create alarm based on Anomaly Detection
aws cloudwatch put-metric-alarm \
--alarm-name "eks-cpu-anomaly" \
--alarm-description "EKS CPU utilization anomaly detected" \
--evaluation-periods 3 \
--comparison-operator LessThanLowerOrGreaterThanUpperThreshold \
--threshold-metric-id ad1 \
--metrics '[
{
"Id": "m1",
"MetricStat": {
"Metric": {
"Namespace": "ContainerInsights",
"MetricName": "pod_cpu_utilization",
"Dimensions": [
{"Name": "ClusterName", "Value": "'$CLUSTER'"}
]
},
"Period": 300,
"Stat": "Average"
}
},
{
"Id": "ad1",
"Expression": "ANOMALY_DETECTION_BAND(m1, 2)"
}
]' \
--alarm-actions $SNS_TOPIC_ARN
Anomaly Detection Learning Period

Anomaly Detection requires a minimum learning period of 2 weeks. Use threshold-based alerts in parallel immediately after deploying a new service.

Pattern 3: Composite Alarms

Logically combines multiple individual alarms to reduce noise and detect actual incidents accurately.

# Combine individual alarms with AND/OR
aws cloudwatch put-composite-alarm \
--alarm-name "eks-service-degradation" \
--alarm-rule 'ALARM("high-error-rate") AND (ALARM("high-latency") OR ALARM("pod-restart-spike"))' \
--alarm-actions $SNS_TOPIC_ARN \
--alarm-description "Service degradation detected: error rate increase + latency increase or Pod restart spike"
Composite Alarm Tips

Individual alarms alone often produce many False Positives. Combining multiple signals with Composite Alarms enables accurate detection of real incidents. For example: "error rate increase AND latency increase" indicates a service outage, while "error rate increase AND Pod restarts" indicates an application crash.

Pattern 4: Log-Based Metric Filters

Detects specific patterns in CloudWatch Logs, converts them to metrics, and sets up alerts.

# Convert OOMKilled events to metrics
aws logs put-metric-filter \
--log-group-name "/aws/eks/$CLUSTER/cluster" \
--filter-name "OOMKilledEvents" \
--filter-pattern '{ $.reason = "OOMKilled" || $.reason = "OOMKilling" }' \
--metric-transformations \
metricName=OOMKilledCount,metricNamespace=EKS/Custom,metricValue=1,defaultValue=0

# Detect 403 Forbidden events (security threat)
aws logs put-metric-filter \
--log-group-name "/aws/eks/$CLUSTER/cluster" \
--filter-name "UnauthorizedAccess" \
--filter-pattern '{ $.responseStatus.code = 403 }' \
--metric-transformations \
metricName=ForbiddenAccessCount,metricNamespace=EKS/Security,metricValue=1,defaultValue=0

9.4 Incident Detection Maturity Model

The organization's incident detection capability is divided into 4 levels to diagnose the current state and provide a roadmap for growth to the next level.

LevelStageDetection MethodToolsTarget MTTD
Level 1BasicManual monitoring + basic alarmsCloudWatch Alarms< 30 min
Level 2StandardThresholds + log metric filtersCloudWatch + Prometheus< 10 min
Level 3AdvancedAnomaly detection + Composite AlarmsAnomaly Detection + AMP< 5 min
Level 4AutomatedAuto-detection + auto-remediationLambda + EventBridge + FIS< 1 min
MTTD (Mean Time To Detect)

The average time from incident occurrence to detection. The goal is to continuously reduce MTTD as you grow from Level 1 to Level 4. Choose the appropriate level for your organization's SLOs.

9.5 Auto-Remediation Patterns

A pattern that uses EventBridge and Lambda to automatically execute recovery actions when a specific incident is detected.

# EventBridge rule: Detect Pod OOMKilled → trigger Lambda
aws events put-rule \
--name "eks-oom-auto-remediation" \
--event-pattern '{
"source": ["aws.cloudwatch"],
"detail-type": ["CloudWatch Alarm State Change"],
"detail": {
"alarmName": ["eks-oom-killed-alarm"],
"state": {"value": ["ALARM"]}
}
}'
Auto-Remediation Caution

Apply auto-remediation to production only after thorough testing. Incorrect auto-remediation logic can worsen an incident. First validate the recovery logic in DRY_RUN mode where you only receive notifications, then gradually expand the scope of automation.

Set appropriate alert channels and response SLAs based on incident severity to prevent Alert Fatigue and focus on critical incidents.

SeverityAlert ChannelResponse SLAExamples
P1 (Critical)PagerDuty + Phone CallWithin 15 minComplete service outage, risk of data loss
P2 (High)Slack DM + PagerDutyWithin 30 minPartial service outage, severe performance degradation
P3 (Medium)Slack channelWithin 4 hoursIncreasing Pod restarts, resource usage warnings
P4 (Low)Email / Jira ticketNext business dayDisk usage growth, certificate nearing expiry
Alert Fatigue Caution

Too many alerts cause operations teams to ignore them (Alert Fatigue). Deliver P3/P4 alerts only to Slack channels, and send only genuine incidents (P1/P2) to PagerDuty. It is important to periodically review alert rules and eliminate False Positives.


10. Debugging Quick Reference

Error Pattern → Cause → Resolution Quick Reference Table

🔍 에러 패턴 Quick Reference
에러 패턴 → 원인 → 해결 빠른 참조 (18건)
1CrashLoopBackOff
앱 크래시, 잘못된 설정, 의존성 미충족
kubectl logs --previous, 앱 설정/환경변수 점검
2ImagePullBackOff
이미지 미존재, 레지스트리 인증 실패
이미지 이름/태그 확인, imagePullSecrets 설정
3OOMKilled
메모리 limits 초과
메모리 limits 증가, 앱 메모리 누수 점검
4Pending (스케줄링 불가)
리소스 부족, nodeSelector 불일치
kubectl describe pod 이벤트 확인, 노드 용량/라벨 점검
5CreateContainerConfigError
ConfigMap/Secret 미존재
참조되는 ConfigMap/Secret 존재 여부 확인
6Node NotReady
kubelet 장애, 리소스 압박
SSM으로 노드 접속, systemctl status kubelet
7FailedAttachVolume
EBS 볼륨 다른 노드에 연결됨
이전 Pod 삭제, 볼륨 detach 대기 (~6분)
8FailedMount
EFS mount target/SG 설정 오류
mount target 존재 및 TCP 2049 허용 확인
9NetworkNotReady
VPC CNI 미시작
kubectl logs -n kube-system -l k8s-app=aws-node
10DNS resolution failed
CoreDNS 장애
CoreDNS Pod 상태/로그 확인, kubectl rollout restart
11Unauthorized / 403
RBAC 권한 부족, aws-auth 설정 오류
aws sts get-caller-identity, aws-auth/Access Entry 확인
12connection refused
Service Endpoint 없음, 포트 불일치
kubectl get endpoints, selector 및 포트 확인
13Evicted
노드 리소스 압박 (DiskPressure 등)
노드 디스크 정리, Pod resource requests 조정
14FailedScheduling: Insufficient cpu/memory
클러스터 용량 부족
Karpenter NodePool limits 증가, 노드 추가
15Terminating (stuck)
Finalizer 미완료, preStop hook 지연
Finalizer 확인, 필요시 --force --grace-period=0
16Back-off pulling image
이미지 크기 큰 경우 pull 타임아웃
이미지 최적화, ECR 같은 리전 레지스트리 사용
17readiness probe failed
앱 시작 지연, 헬스체크 엔드포인트 오류
startupProbe 추가, probe 타임아웃 조정
18Too many pods
노드당 최대 Pod 수 초과
max-pods 설정 확인, Prefix Delegation 활성화

Essential kubectl Command Cheat Sheet

Inspection and Diagnosis

# View all resource status at a glance
kubectl get all -n <namespace>

# Filter only unhealthy Pods
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded

# Detailed Pod information (including events)
kubectl describe pod <pod-name> -n <namespace>

# Namespace events (most recent first)
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Resource usage
kubectl top nodes
kubectl top pods -n <namespace> --sort-by=memory

Log Inspection

# Current container logs
kubectl logs <pod-name> -n <namespace>

# Previous (crashed) container logs
kubectl logs <pod-name> -n <namespace> --previous

# Specific container in a multi-container Pod
kubectl logs <pod-name> -n <namespace> -c <container-name>

# Real-time log streaming
kubectl logs -f <pod-name> -n <namespace>

# View logs from multiple Pods by label
kubectl logs -l app=<app-name> -n <namespace> --tail=50

Debugging

# Debug with an ephemeral container
kubectl debug <pod-name> -it --image=nicolaka/netshoot --target=<container-name>

# Node debugging
kubectl debug node/<node-name> -it --image=ubuntu

# Execute a command inside a Pod
kubectl exec -it <pod-name> -n <namespace> -- <command>

Deployment Management

# Rollout status/history/rollback
kubectl rollout status deployment/<name>
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name>

# Restart Deployment
kubectl rollout restart deployment/<name>

# Node maintenance (drain)
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl uncordon <node-name>
ScenarioToolDescription
Network debuggingnetshootContainer with a comprehensive set of network tools
Node resource visualizationeks-node-viewerTerminal-based node resource monitoring
Container runtime debuggingcrictlcontainerd debugging CLI
Log analysisCloudWatch Logs InsightsAWS-native log query service
Metric queryingPrometheus / GrafanaPromQL-based metric analysis
Distributed tracingADOT / OpenTelemetryRequest path tracing
Cluster security auditkube-benchCIS Benchmark-based security scanning
YAML manifest validationkubeval / kubeconformPre-deployment manifest validation
Karpenter debuggingKarpenter controller logsNode provisioning issue diagnosis
IAM debuggingAWS IAM Policy SimulatorIAM permission verification

EKS Log Collector

EKS Log Collector is an AWS-provided script that automatically collects logs needed for debugging from EKS worker nodes and generates an archive file that can be submitted to AWS Support.

Installation and Execution:

# Download and run the script (on the node after SSM access)
curl -O https://raw.githubusercontent.com/awslabs/amazon-eks-ami/master/log-collector-script/linux/eks-log-collector.sh
sudo bash eks-log-collector.sh

Collected Items:

  • kubelet logs
  • containerd logs
  • iptables rules
  • CNI config (VPC CNI settings)
  • cloud-init logs
  • dmesg (kernel messages)
  • systemd units status

Output:

The collected logs are compressed and saved in the format /var/log/eks_i-xxxx_yyyy-mm-dd_HH-MM-SS.tar.gz.

S3 Upload:

# Upload collected logs directly to S3
sudo bash eks-log-collector.sh --upload s3://my-bucket/
Leveraging AWS Support

Attaching this log file when submitting an AWS Support case enables support engineers to quickly assess the node status, significantly reducing resolution time. Be sure to attach it when reporting node join failures, kubelet failures, network issues, etc.

References