Skip to main content

EKS Auto Mode Debugging

EKS Auto Mode is an operational model in which AWS fully manages node provisioning, networking, and storage. While convenient, the reduced management surface also changes the debugging approach.

Auto Mode vs Standard Mode Differences

ItemStandard ModeAuto ModeDebugging Impact
Node managementUser (MNG/Karpenter)AWS-managed (NodePool)Check state via NodePool CRD; EC2 API access is limited
VPC CNIManual configuration/upgradeAutomatically managedCustom CNI configuration unavailable; ENI debugging simplified
GPU DriverGPU Operator installationAWS-managedBeware of Device Plugin conflicts (devicePlugin=false)
StorageSeparate EBS CSI installationBuilt-in driver (gp3)io2 Block Express constraints; EFS requires separate install
CoreDNSAdd-on managementAutomatically managedCustom CoreDNS configuration is restricted
Node SSHAvailable (MNG/Karpenter)Restricted (AWS Systems Manager)kubectl debug node required
Auto ScalingKarpenter/CANodePool auto-scalingSpot interruption handling is automated
Network PolicyCalico/Cilium installableVPC CNI Network PolicyFeature limitations exist

NodePool Architecture

Auto Mode node lifecycle:

NodePool Debugging

Check NodePool Status

# List NodePools
kubectl get nodepools

# Example output
# NAME READY AGE
# default True 7d
# gpu-nodepool True 2d

# NodePool details
kubectl describe nodepool default

# Key items to check:
# - Conditions: Ready, CapacityAvailable
# - Instance Types: allowed instance types
# - Constraints: labels, taints, availability zones

NodeClaim Lifecycle

# List NodeClaims (actual node requests)
kubectl get nodeclaims

# Example output
# NAME TYPE CAPACITY READY AGE
# default-abc123 t3.xlarge 4 True 2d
# default-def456 t3.xlarge 4 True 1d
# gpu-nodepool-xyz789 g5.2xlarge 8 True 6h

# NodeClaim details
kubectl describe nodeclaim <nodeclaim-name>

# Key fields:
# - Phase: Pending/Launched/Registered/Ready/Terminating
# - Conditions: Initialized, Ready, Drifted
# - Instance ID: EC2 instance ID
# - Node Name: corresponding Kubernetes node

NodeClaim State Transitions

Instance Type Selection Failure

Symptom: Pod stays in Pending, NodeClaim is not created

# Check Pod events
kubectl describe pod <pod-name>

# Example error:
# Warning FailedScheduling No nodes available to schedule pod

# Check NodePool constraints
kubectl get nodepool <nodepool-name> -o yaml | grep -A 10 requirements

# Common causes:
# 1. Pod resource request exceeds every instance type in the NodePool
# 2. Availability zone constraint (insufficient capacity in specific AZs)
# 3. Spot capacity shortage (capacityType: spot)

Resolution:

# Modify NodePool: add larger instance type
apiVersion: eks.amazonaws.com/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- t3.large
- t3.xlarge
- t3.2xlarge # ← added
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand # ← On-Demand fallback when Spot is unavailable

Storage Debugging

Auto Mode Storage Constraints

Storage TypeStandard ModeAuto ModeConstraints
gp3Requires EBS CSI installationBuilt-in supportProvided by default; no extra configuration
gp2SupportedNot supportedMust migrate to gp3
io2SupportedLimited supportio2 Block Express not supported
EFSInstall EFS CSIEFS CSI installation requiredNot automatically supported
FSx for LustreInstall FSx CSIFSx CSI installation requiredNot automatically supported
EBS encryptionCustom KMS key possibleDefault EBS encryptionCustom KMS key constraints

PVC Pending Debugging

# Check PVC status
kubectl get pvc

# Example output (issue)
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
# my-pvc Pending gp3 5m

# Check PVC events
kubectl describe pvc my-pvc

# Common errors:
# 1. "waiting for a volume to be created" → check storage driver
# 2. "failed to provision volume" → check IAM permissions
# 3. "io2-block-express is not supported" → switch to gp3

Check StorageClass

# List StorageClasses
kubectl get storageclass

# Auto Mode default StorageClass
# NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
# gp3 (default) ebs.csi.aws.com Delete WaitForFirstConsumer true 7d

# io2 Block Express not supported (Auto Mode constraint)

Networking Debugging

VPC CNI Auto Management

In Auto Mode, VPC CNI cannot be configured directly:

# Check VPC CNI version (auto-managed)
kubectl get daemonset -n kube-system aws-node -o yaml | grep image:

# Custom CNI configuration attempts produce an error
# Auto Mode blocks modifications to the VPC CNI ConfigMap
kubectl edit configmap -n kube-system aws-node
# Error: Auto Mode managed resource cannot be modified

Constraints:

  • Supported: ENI auto-allocation, Security Group for Pods, IPv6
  • Not supported: Custom CIDR blocks, disabling Prefix Delegation, manual ENI management

Pod Networking Issues

# Check Pod IP allocation
kubectl get pods -o wide

# Check ENI allocation state (node level)
kubectl describe node <node-name> | grep -A 5 "Allocatable"

# Example output:
# Allocatable:
# vpc.amazonaws.com/pod-eni: 38 # ← ENI-based IP count

# Check Security Group for Pods
kubectl get securitygrouppolicies -A

CoreDNS Debugging

# CoreDNS Pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

# DNS resolution test
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default

# Common issues:
# 1. CoreDNS Pod not Running → node resource shortage
# 2. DNS query timeout → verify Security Group allows UDP 53

GPU Workloads with Auto Mode

GPU Operator Conflict

Auto Mode automatically manages the GPU Driver. Installing the GPU Operator causes a Device Plugin conflict.

To run GPU workloads in Auto Mode, add an MNG for a hybrid configuration:

GPU MNG Configuration

# ClusterPolicy: Device Plugin must be disabled
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
operator:
defaultRuntime: containerd
driver:
enabled: true
devicePlugin:
enabled: false # ← prevent conflict with Auto Mode
dcgm:
enabled: true # metric collection still possible
gfd:
enabled: true # GPU Feature Discovery works
nodeStatusExporter:
enabled: true
# Add a Taint to MNG nodes (GPU workloads only)
apiVersion: v1
kind: Node
metadata:
name: gpu-node-1
spec:
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
# GPU Pods add Toleration
apiVersion: v1
kind: Pod
metadata:
name: vllm-server
spec:
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: 4

For detailed GPU debugging, see GPU/AI Workload Debugging.

Auto Mode Constraint Summary

Supported Features

  • NodePool-based auto-scaling
  • Automatic Spot/On-Demand fallback
  • gp3 storage out of the box
  • Automatic VPC CNI management (including Security Group for Pods)
  • Karpenter-like Consolidation
  • Drift detection and automatic replacement
  • DCGM/GFD metrics (partial GPU Operator support)

Limitations

  • Custom VPC CNI configuration not available
  • GPU Device Plugin conflict (MNG hybrid required)
  • io2 Block Express not supported
  • EFS/FSx CSI require separate installation
  • Custom CoreDNS configuration restricted
  • Node SSH access restricted (use SSM)
  • Direct EC2 instance management not available

Hybrid Configuration (Auto Mode + MNG)

When Is Hybrid Needed?

ScenarioAuto Mode OnlyHybrid (Auto Mode + MNG)
General web/API serversSufficientNot needed
GPU inference/trainingMany constraintsRequired (GPU Operator)
High-performance storage (io2 BE)Not supportedAvailable on MNG
Custom VPC CNINot supportedAvailable on MNG
Use of a specific AMILimitedMNG Launch Template

Hybrid Configuration Example

# 1. Create Auto Mode cluster
aws eks create-cluster \
--name hybrid-cluster \
--compute-config enabled=true

# 2. Add GPU MNG
aws eks create-nodegroup \
--cluster-name hybrid-cluster \
--nodegroup-name gpu-nodes \
--node-role <node-role-arn> \
--subnets <subnet-ids> \
--instance-types g5.2xlarge g5.4xlarge \
--scaling-config minSize=0,maxSize=10,desiredSize=2 \
--labels workload=gpu \
--taints nvidia.com/gpu=true:NoSchedule

# 3. Install GPU Operator (targeting MNG nodes)
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace \
--set operator.defaultRuntime=containerd \
--set driver.enabled=true \
--set devicePlugin.enabled=false # ← key setting

Diagnostic Command Collection

# === NodePool ===
# NodePool status
kubectl get nodepools -o wide
kubectl describe nodepool <nodepool-name>

# NodeClaim status
kubectl get nodeclaims -o wide
kubectl describe nodeclaim <nodeclaim-name>

# NodeClaim to Node mapping
kubectl get nodeclaims -o json | jq -r '.items[] | "\(.metadata.name) → \(.status.nodeName)"'

# === Storage ===
# PVC status
kubectl get pvc -A
kubectl describe pvc <pvc-name>

# Check StorageClasses
kubectl get storageclass

# Check EBS volumes (AWS CLI)
aws ec2 describe-volumes --filters "Name=tag:kubernetes.io/cluster/<cluster-name>,Values=owned"

# === Networking ===
# VPC CNI version
kubectl get daemonset -n kube-system aws-node -o yaml | grep image:

# Pod IP allocation
kubectl get pods -A -o wide

# CoreDNS status
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# DNS test
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default

# === GPU (hybrid configuration) ===
# GPU Operator status (MNG nodes only)
kubectl get clusterpolicy -A
kubectl get pods -n gpu-operator

# Check GPU resources
kubectl get nodes -o json | jq -r '.items[] | select(.status.allocatable."nvidia.com/gpu" != null) | "\(.metadata.name): \(.status.allocatable."nvidia.com/gpu") GPUs"'

# === Node debugging ===
# Run an interactive debug Pod on the node
kubectl debug node/<node-name> -it --image=ubuntu

# Connect to node via Systems Manager (instead of SSH)
aws ssm start-session --target <instance-id>

Checklist by Problem

Pod Stuck in Pending (NodeClaim Not Created)

  • Do any NodePool instance types satisfy the Pod's resource request?
  • Are there availability zone constraints on the NodePool?
  • Spot capacity shortage? (add On-Demand fallback)
  • Do NodePool labels/taints match the Pod?

PVC Stuck in Pending

  • Is the StorageClass gp3? (io2 Block Express not supported)
  • Is the PVC size within allowed range?
  • Are IAM permissions correct? (EBS creation)
  • Is EBS capacity sufficient in the availability zone?

GPU Workload Scheduling Failure

  • Was an MNG added? (Auto Mode alone has GPU constraints)
  • Is devicePlugin: false set in the GPU Operator?
  • Does the MNG node have a Taint and does the Pod have a matching Toleration?
  • Is the Pod's nvidia.com/gpu resource request correct?

VPC CNI Configuration Not Possible

  • Auto Mode manages VPC CNI automatically (no custom configuration allowed)
  • If a specific CNI configuration is required, add an MNG
  • Security Group for Pods is supported

References