GPU Resource Management
GPU resource management strategies in EKS environments are organized around three axes.
| Axis | Key Question | Core Technologies |
|---|---|---|
| Provisioning | Which GPU nodes to create and when? | Karpenter, EKS Auto Mode, Managed Node Group |
| Scheduling | Which node to place GPU Pods on? | Device Plugin, DRA, Topology-Aware Routing |
| Scaling | How to respond to traffic changes? | KEDA, HPA, Cluster Autoscaler |
This document covers the architecture and design decision criteria for each axis. For GPU Operator details (ClusterPolicy, DCGM, MIG, Time-Slicing, Dynamo, KAI Scheduler, and other NVIDIA software stack components), see NVIDIA GPU Stack.
Karpenter GPU NodePool
Karpenter v1.2+ GA
Karpenter has been GA since v1.0, and all examples in this document use the karpenter.sh/v1 API.
GPU Node Auto-Provisioning Concept
Karpenter analyzes Pending Pod resource requests (nvidia.com/gpu, memory, CPU) to automatically provision the optimal EC2 instance. The core value of Karpenter for GPU workloads includes:
- Instance diversity: Support for various GPU instances (p4d, p5, g5, g6e, etc.) in a single NodePool
- Spot/On-Demand mix: Balance cost and stability with capacity-type
- Consolidation: Automatically clean up idle GPU nodes for cost savings
- Taint-based isolation: Set
nvidia.com/gputaint on GPU nodes to exclude non-GPU workloads
NodePool Configuration Example
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-inference-pool
spec:
template:
metadata:
labels:
node-type: gpu-inference
workload: genai
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- p4d.24xlarge # 8x A100 40GB
- p5.48xlarge # 8x H100 80GB
- g5.48xlarge # 8x A10G 24GB
- key: karpenter.k8s.aws/instance-gpu-count
operator: Gt
values: ["0"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-nodeclass
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
limits:
cpu: 1000
memory: 4000Gi
nvidia.com/gpu: 64
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
weight: 100
Design Points:
limits.nvidia.com/gpu: 64— Cluster-wide GPU cap to prevent cost runawaydisruption.consolidateAfter: 30s— Quick cleanup is key since GPU nodes are expensiveweight: 100— Priority setting among multiple NodePools