Skip to main content

EKS GPU Node Strategy

Overview

When operating GPU workloads on EKS, node type selection directly impacts operational complexity, cost, and feature utilization. GPU inference and training workloads have special requirements unlike general container workloads:

  • Driver dependencies: NVIDIA GPU drivers, Container Toolkit, Device Plugin
  • Advanced features: MIG (Multi-Instance GPU), Time-Slicing, Fractional GPU
  • Monitoring: DCGM (Data Center GPU Manager)-based metrics
  • Scheduling: Topology-Aware Placement, Gang Scheduling

AWS EKS provides 4 node types for GPU workloads:

Node TypeDescription
EKS Auto ModeAWS fully manages the entire node lifecycle (GPU drivers pre-installed)
KarpenterAuto-scaling + Custom AMI, MIG, full user customization
Managed Node GroupAWS-managed node groups, only option supporting DRA (Dynamic Resource Allocation)
Hybrid NodeConnect on-premises GPU servers to the EKS cluster
Core Principle

You can operate multiple node types simultaneously in a single EKS cluster. Configure the optimal node combination matching your workload characteristics.

Scope of This Document

This document focuses on node type selection and hybrid architecture design. Detailed NVIDIA software stack (GPU Operator/DCGM/Dynamo), GPU autoscaling, llm-d distributed inference, and security/troubleshooting are covered in their respective specialized documents (see Section 7 Related Documents).


2. Node Type Comparison

2.1 Feature Comparison Table

FeatureAuto ModeKarpenterManaged Node GroupHybrid Node
Management OwnerAWS fully managedSelf-ManagedAWS managedOn-Premises
Auto-scalingAutomatic (AWS controlled)Automatic (NodePool-based)Manual/LimitedManual
Custom AMINot availableAvailableAvailableAvailable
SSH AccessNot availableAvailableAvailableAvailable
GPU DriverPre-installed (AWS)User-installedUser-installedUser-installed
GPU OperatorAvailable (Device Plugin label disabled)AvailableAvailableAvailable
Root FilesystemRead-OnlyRead-WriteRead-WriteRead-Write
MIG SupportNot available (NodeClass read-only)AvailableAvailableAvailable
DRA CompatibleNot available (internal Karpenter-based)Not available (#1231)Available (recommended)Available
DCGM ExporterInstall via GPU OperatorIncluded in GPU OperatorManual installationIncluded in GPU Operator
Run:ai CompatibleAvailable (Device Plugin disabled)AvailableAvailableAvailable
CostLow (no management needed)MediumMediumLow (Capex)
Suitable WorkloadsSimple inferenceAdvanced GPU featuresDRA workloadsOn-premises integration

2.2 Selection Guide: When to Use Which Node

Choose Auto Mode when:

  • You want to quickly start inference services without GPU driver management burden
  • Serving large models (70B+) that don't require MIG or Fractional GPU
  • System/non-GPU workloads (API Gateway, Agent, Observability)

Choose Karpenter when:

  • You need flexible control over MIG partitioning, Custom AMI, Spot Instances
  • Using projects dependent on GPU Operator ClusterPolicy (Run:ai, KAI Scheduler)
  • Optimizing GPU utilization for small/medium models (MIG partitioning)

Choose Managed Node Group when:

  • DRA (Dynamic Resource Allocation)-based GPU management is required
  • Using DRA-exclusive instances like P6e-GB200 UltraServer

Choose Hybrid Node when:

  • Integrating existing on-premises GPU server assets into EKS
  • Data residency requirements

3. EKS Auto Mode GPU Support and Limitations

3.1 GPU Stack Auto-Provided by Auto Mode

EKS Auto Mode pre-installs the following on GPU instances:

  1. NVIDIA GPU Driver - AWS-managed version, /dev/nvidia* devices auto-created
  2. NVIDIA Container Toolkit - containerd plugin auto-configured
  3. NVIDIA Device Plugin - nvidia.com/gpu resource auto-registered
  4. GPU Resource Registration - Pods can immediately request nvidia.com/gpu: 1
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
containers:
- name: cuda-test
image: nvidia/cuda:12.2.0-runtime-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1

3.2 Installing GPU Operator on Auto Mode: Device Plugin Disable Pattern

GPU Operator can be installed on Auto Mode. The key is to disable only the Device Plugin via node labels while keeping other components (DCGM Exporter, NFD, GFD) running normally. This pattern was validated in awslabs/ai-on-eks PR #288.

Why is GPU Operator needed? Several projects including KAI Scheduler and Run:ai depend on GPU Operator's ClusterPolicy CRD. Without ClusterPolicy, these projects cannot even start. This is the core reason for installing GPU Operator on Auto Mode.

For complete GPU Operator architecture and component details, see NVIDIA GPU Stack.

ClusterPolicy CRD (GPU Operator)
↓ depends on
KAI Scheduler (GPU-aware Pod placement)
Run:ai (Fractional GPU, Gang Scheduling)
↓ reads
DCGM Exporter (GPU metrics)
NFD/GFD (Hardware labels)
GPU Operator ComponentAuto Mode SettingReason
Driverenabled: falsePre-installed in AMI
Container Toolkitenabled: falsePre-installed in AMI
Device PluginDisabled via labelAWS manages its own Device Plugin
DCGM Exporterenabled: trueGPU metrics collection
NFD / GFDenabled: trueHardware feature detection and GPU attribute labeling

NodePool label configuration to disable Device Plugin:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-auto-mode
spec:
template:
metadata:
labels:
nvidia.com/gpu.deploy.device-plugin: "false"
spec:
requirements:
- key: eks.amazonaws.com/instance-family
operator: In
values: ["p5", "g6e", "g5"]
nodeClassRef:
group: eks.amazonaws.com
kind: NodeClass
name: default

Helm Values (for Auto Mode):

driver:
enabled: false
toolkit:
enabled: false
devicePlugin:
enabled: true # Globally enabled, selectively disabled via node labels
dcgmExporter:
enabled: true
serviceMonitor:
enabled: true
nfd:
enabled: true
gfd:
enabled: true
Actual Auto Mode Limitations

While GPU Operator installation is possible, since NodeClass is read-only, the following are not available:

  • MIG Partitioning: Cannot configure MIG profiles in NodeClass
  • Custom AMI: Cannot pin specific driver versions
  • SSH/SSM Access: Cannot directly debug nodes

If MIG-based GPU partitioning is needed, switch to Karpenter + GPU Operator.

3.3 Large GPU Instance Support Status (Verified 2026.04)

Auto Mode large GPU instance support status confirmed during GLM-5 (744B MoE) deployment. p5.48xlarge Spot provisioning was successful, but p5en/p6 have current limitations.

Detailed Support Status: See EKS Auto Mode GPU Instance Support Status

3.4 Auto Mode + MNG Hybrid Limitation

The hybrid pattern of adding MNG to an Auto Mode cluster for p5en/p6 usage is currently not possible:

  • MNG creation stalls in CREATING state for 30+ minutes
  • CloudFormation stack Resources field remains null
  • Auto Mode's managed compute layer conflicts internally with MNG's ASG-based management

Conclusion: For large GPUs (H200+, B200), use EKS Standard Mode + Karpenter + MNG.

3.5 Device Plugin Conflict Resolution

Installing GPU Operator with devicePlugin.enabled=true on Auto Mode nodes conflicts with the built-in Device Plugin.

kubectl describe node <gpu-node> | grep nvidia.com/gpu
# Allocatable: nvidia.com/gpu: 0 (expected: 8)

Solution: Add nvidia.com/gpu.deploy.device-plugin: "false" label to NodePool (see Section 3.2)

3.6 Node Force Termination Not Available

EC2 instances managed by Auto Mode block ec2:TerminateInstances. Abnormal node recovery procedure:

  1. Delete workload: kubectl delete pod <gpu-pod>
  2. Delete NodeClaim: kubectl delete nodeclaim <nodeclaim-name>
  3. Karpenter detects empty node and auto-terminates (5-10 min)
  4. New NodeClaim creation starts a healthy node

3.7 How to Verify Auto Mode Instance Support

You can pre-verify specific instance type support with a NodePool dry-run:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-test-dryrun
spec:
template:
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ["p5en.48xlarge"]
nodeClassRef:
group: eks.amazonaws.com
kind: NodeClass
name: default
limits:
nvidia.com/gpu: "8"

If NoCompatibleInstanceTypes appears in kubectl get nodeclaim events after dry-run, that instance type is not supported in Auto Mode.


4. Karpenter GPU NodePool Configuration

4.1 Why Karpenter

Karpenter is the optimal balance point that maintains Auto Mode's auto-scaling advantages while fully utilizing GPU Operator.

FeatureAuto ModeKarpenter
Auto-scalingAutomatic (AWS controlled)Automatic (NodePool-based)
GPU OperatorAvailable (Device Plugin disabled)Fully available
Custom AMINot availableAvailable
MIG SupportNot availableAvailable
Spot InstanceLimitedFully supported
Node Replacement SpeedFastVery fast

4.2 Inference Workload NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-inference
spec:
template:
metadata:
labels:
node-type: gpu-inference
gpu-operator: enabled
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- p5.48xlarge # H100 x8 (640GB HBM3)
- g6e.12xlarge # L40S x4 (192GB GDDR6)
- g5.12xlarge # A10G x4 (96GB GDDR6)
- key: karpenter.sh/capacity-type
operator: In
values: [on-demand]
- key: topology.kubernetes.io/zone
operator: In
values: [us-west-2a, us-west-2b, us-west-2c]
taints:
- key: nvidia.com/gpu
effect: NoSchedule
value: "true"
kubelet:
maxPods: 110
evictionHard:
memory.available: "10Gi"
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 5m
limits:
cpu: "1000"
memory: "4000Gi"
nvidia.com/gpu: "32"

4.3 Training Workload NodePool (Spot + On-Demand fallback)

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-training
spec:
template:
metadata:
labels:
node-type: gpu-training
gpu-operator: enabled
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- p5.48xlarge # H100 x8
- key: karpenter.sh/capacity-type
operator: In
values: [spot, on-demand] # Spot first, On-Demand fallback
taints:
- key: workload
effect: NoSchedule
value: "training"
kubelet:
maxPods: 50
evictionHard:
memory.available: "20Gi"
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30m # Prevent training interruption
limits:
nvidia.com/gpu: "64"

4.4 EC2NodeClass Configuration

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu-inference
spec:
amiSelectorTerms:
- alias: al2023
role: KarpenterNodeRole-eks-genai-cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: eks-genai-cluster
subnet-type: private
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: eks-genai-cluster
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 200Gi
volumeType: gp3
iops: 16000
throughput: 1000
encrypted: true
deleteOnTermination: true
metadataOptions:
httpEndpoint: enabled
httpPutResponseHopLimit: 2
httpTokens: required # IMDSv2
tags:
Environment: production
ManagedBy: karpenter

4.5 GPU Operator Helm Values (for Karpenter Nodes)

# helm install gpu-operator nvidia/gpu-operator -f values.yaml
driver:
enabled: false # AL2023: Pre-installed in AMI

toolkit:
enabled: false # AL2023: Pre-installed in AMI

devicePlugin:
enabled: true
nodeSelector:
gpu-operator: enabled
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule

migManager:
enabled: true
nodeSelector:
gpu-operator: enabled
config:
name: mig-parted-config
default: "all-balanced"

dcgmExporter:
enabled: true
serviceMonitor:
enabled: true
interval: 15s
nodeSelector:
gpu-operator: enabled

nfd:
enabled: true

gfd:
enabled: true
nodeSelector:
gpu-operator: enabled

operator:
nodeSelector:
node-type: gpu-inference # Karpenter NodePool label
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
defaultRuntime: containerd

Key Configuration Points:

  • nodeSelector: gpu-operator: enabled -- Excludes Auto Mode nodes
  • driver/toolkit: false -- Pre-installed in AL2023 AMI
  • migManager: true -- Enables MIG functionality on Karpenter nodes

4.6 GPU Topology-Based Scheduling

In distributed training, placing GPUs connected via NVLink on the same node is critical for performance:

# GPU topology hints in Pod configuration
apiVersion: v1
kind: Pod
spec:
containers:
- name: pytorch-ddp
resources:
limits:
nvidia.com/gpu: 4
# Place GPUs within the same NVLink domain
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: distributed-training

4.7 Spot Price Comparison (us-east-2, 2026.04)

InstanceOn-DemandSpot (Lowest)VRAMSavings
p5.48xlarge$98/hr$12.5/hr640GB87%
p5en.48xlarge~$120/hr$12.1/hr1,128GB90%
p6-b200.48xlarge$180/hr$11.4/hr1,536GB94%
Spot Usage Recommendation

Large GPU instances can achieve 85-94% cost savings with Spot. Actively use Spot for PoC/demo environments, and set consolidationPolicy: WhenEmpty to prevent unnecessary disruption. Prices are approximate; verify real-time pricing at AWS Spot Pricing.


5.1 3-Node Type Coexistence Architecture

Operate Auto Mode + Karpenter + Hybrid Node simultaneously in a single EKS cluster.

5.2 Per-Workload Node Placement Strategy

Workload TypeNode TypeGPU OperatorReason
System ComponentsAuto ModeNot neededNo management needed, cost minimization
API Gateway / AgentAuto ModeNot neededCPU workloads
Simple GPU Inference (70B+)Auto ModeOptional (needed for DCGM)MIG not needed, fast scaling
MIG-Based InferenceKarpenterRequiredMIG Manager needed
Fractional GPUKarpenterRequiredRun:ai needed
Model TrainingKarpenterRequiredGang Scheduling, Spot
DRA WorkloadsManaged Node GroupRequiredNot supported on Karpenter/Auto Mode
On-Premises GPUHybrid NodeRequiredNo AWS-managed GPU stack

5.3 MNG Hybrid for DRA Workloads

DRA (Dynamic Resource Allocation) was promoted to GA in K8s 1.34, providing advanced GPU management beyond Device Plugin including fine-grained GPU memory allocation and NVLink topology-aware scheduling. However, DRA cannot be used with Karpenter and Auto Mode.

DRA + Karpenter/Auto Mode Incompatibility

Karpenter skips node provisioning when it detects spec.resourceClaims in a Pod (PR #2384). Karpenter simulates Pod requirements to calculate optimal instances, but DRA's ResourceSlice is only issued by the DRA Driver after a node exists — making pre-node-creation simulation impossible (chicken-and-egg problem).

The only officially supported node management method for DRA workloads is Managed Node Group + Cluster Autoscaler.

WorkloadNode TypeGPU Allocation MethodScaling
DRA Workloads (llm-d, P6e-GB200)Managed Node GroupResourceClaim (DRA)Cluster Autoscaler
Standard GPU Inference (vLLM standalone)Karpenter / Auto Modenvidia.com/gpu (Device Plugin)Karpenter
Non-GPU WorkloadsKarpenter / Auto Mode-Karpenter

For detailed DRA scale-out strategies, see GPU Resource Management.

Model SizeExampleRecommended NodeReason
70B+Qwen3-72B, Llama-3-70BAuto Mode + llm-dUses nearly all GPU, management convenience
30B-65BQwen3-32BAuto Mode or Karpenter50%+ GPU usage, choose based on situation
13B-30BLlama-3-13BKarpenter + MIG 2-way splitGPU utilization improvement needed
7B and belowLlama-3-8B, Mistral-7BKarpenter + MIG 4-7 way splitSevere GPU waste, MIG essential
Multi-ModelMultiple models simultaneouslyKarpenter + MIGSeparate MIG partitions per model
Dev/TestModel agnosticAuto ModeQuick start

5.5 Cost Impact by Model Size

Based on p5.48xlarge (H100 x8), monthly cost approximately $98,000:

Configuration7B Model InstancesGPU UsageGPU UtilizationEffective Cost/Instance
Auto Mode (full GPU allocation)88 GPUs~25%$12,250
Karpenter + MIG (4-way split)82 GPUs~80%$3,063
SavingsSame75% reduction3.2x improvement75% reduction
Model Size and Cost Efficiency

The smaller the model parameters, the greater the GPU waste on Auto Mode. Running a 7B model on H100 leaves 80% of GPU memory idle, which is a direct cost waste. MIG partitioning is essential for small/medium models.

5.6 Optimal Configuration for Current Timeframe (2026.04)

For most LLM serving environments, DRA is not yet essential. Device Plugin + MIG combination can sufficiently cover GPU partitioning and topology placement, and Karpenter's fast scale-out is more favorable for LLM serving SLOs than MNG + Cluster Autoscaler.

CriteriaKarpenter + Device PluginMNG + DRA
Scale-out SpeedFast (Karpenter)Slow (Cluster Autoscaler)
GPU PartitioningMIG supported (GPU Operator)DRA native
Operational ComplexitySingle stackMNG + Karpenter mixed
K8s Version1.32+1.34+ (DRA GA)
Ecosystem MaturityProduction-provenEarly stage

Small Scale (< 32 GPUs)

Configuration: Auto Mode + Karpenter (GPU dedicated)
- Auto Mode: General workloads
- Karpenter: GPU inference (Device Plugin)
- GPU Operator: DCGM monitoring
Cost: $5,000 - $15,000/month

Medium Scale (32 - 128 GPUs)

Configuration: Karpenter + GPU Operator + KEDA
- Karpenter NodePool: Separate Prefill / Decode / Small models
- GPU Operator: MIG, DCGM, NFD/GFD
- KEDA: KV Cache / TTFT-based Pod scaling
Cost: $15,000 - $80,000/month

Large Scale (> 128 GPUs)

Configuration: Karpenter + GPU Operator + Run:ai + Hybrid Node
- Karpenter: GPU Operator + Run:ai
- Hybrid Node: On-premises GPU farm integration
- When adopting P6e-GB200: Add MNG + DRA
Cost: $80,000 - $500,000/month (cloud) + Capex (on-premises)

5.8 DRA Transition Timing

ConditionTransition Required
P6e-GB200 UltraServer AdoptionRequired (Device Plugin not supported)
Multi-Node NVLink / IMEX NeededRequired (ComputeDomain is DRA-exclusive)
CEL-Based Fine-Grained GPU Attribute SelectionRecommended
GPU Sharing (MPS)Recommended
Karpenter DRA Support GAOptimal transition timing (MNG not needed)
Transition Strategy

Now: Karpenter + GPU Operator (Device Plugin + MIG) -- Fastest and most operationally viable production configuration

When Adopting P6e-GB200: MNG (DRA, GPU) + Karpenter (non-GPU) hybrid

After Karpenter DRA GA: Karpenter + DRA integration -- Final target configuration


6. AWS Accelerator Selection Guide (NVIDIA vs Neuron)

EKS GPU node strategies have traditionally been designed around NVIDIA GPUs (p/g series), but as of 2026, Trainium2/Inferentia2 based AWS custom accelerators have matured as production alternatives. Neuron stack details are covered in AWS Neuron Stack, while this section only summarizes selection criteria for node strategy planning.

6.1 NVIDIA GPU vs AWS Neuron Decision Matrix

CriteriaNVIDIA GPU (p5/p5en/p6/g6e)AWS Neuron (trn2/inf2)
Model Ecosystem RecencyImmediate support (new models Day-1)AWS porting cycle delay (weeks to months)
Long-Term TCOHigher (H100/H200/B200 Spot still expensive)Favorable cost per token (per AWS data)
Capacity AvailabilityTight depending on region/timingRelatively easier to secure
Custom CUDA KernelsFull supportNot supported (NEFF compilation required)
Quantization FormatsAWQ/GPTQ/GGUF extensiveBF16/FP16/FP8, AWQ/GPTQ limited
Observability EcosystemGPU Operator + DCGM matureneuron-monitor + OSS exporter
Open-Source ServingvLLM, SGLang, TRT-LLM richNxD Inference / vLLM Neuron / TGI Neuron
Bedrock ContinuityUnrelatedSame path as Bedrock internal stack
Hybrid (On-Premises)Possible with Hybrid NodeEC2 only (on-premises not available)

6.2 Selection Flow

  • Frontier (Latest Models) Layer: NVIDIA GPU (p5en/p6) — Rapid adoption of new models
  • Volume (High-Frequency Inference) Layer: Neuron (trn2/inf2) — Low-cost serving of stable models at scale
  • Edge/On-Premises: Hybrid Node + NVIDIA GPU — Neuron is EC2-only

For detailed Neuron SDK, Device Plugin, Karpenter NodePool, and inference framework selection (NxD Inference / vLLM Neuron / TGI Neuron), see AWS Neuron Stack.


7. Node Strategy Decision Flowchart

Decision Summary Table

QuestionAnswerRecommended Node TypeGPU Operator
GPU not needed-Auto ModeNot needed
Simple GPU inference (no MIG)-Auto Mode GPUOptional
MIG needed-KarpenterRequired
DRA needed-Managed Node GroupRequired
Fractional GPU / Run:ai-KarpenterRequired
On-premises GPU-Hybrid NodeRequired
Cost minimization (Spot acceptable)-Karpenter SpotRequired
Large-scale training (Gang Scheduling)-Karpenter + Run:aiRequired
P6e-GB200DRA requiredManaged Node GroupRequired

GPU Stack and Monitoring

For detailed NVIDIA GPU software stack including GPU Operator, DCGM, MIG, Time-Slicing, KAI Scheduler, and Dynamo, see the dedicated document.

  • NVIDIA GPU Stack - GPU Operator, DCGM Exporter, MIG Manager, Dynamo, KAI Scheduler

GPU Resource Management

For GPU autoscaling strategies based on Karpenter, KEDA, and DRA, see:

Inference Engines

Hybrid Infrastructure

For EKS Hybrid Node registration of on-premises GPU servers, VPN/Direct Connect configuration, and GPU Operator installation, see:

Deployment and Security

For production deployment YAML, security policies (Pod Security Standards, NetworkPolicy, IAM), and troubleshooting guides for GPU workloads, see Reference Architecture.

Platform Architecture