GPU Resource Management
GPU resource management strategies in EKS environments are organized around three axes.
| Axis | Key Question | Core Technologies |
|---|---|---|
| Provisioning | Which GPU nodes to create and when? | Karpenter, EKS Auto Mode, Managed Node Group |
| Scheduling | Which node to place GPU Pods on? | Device Plugin, DRA, Topology-Aware Routing |
| Scaling | How to respond to traffic changes? | KEDA, HPA, Cluster Autoscaler |
This document covers the architecture and design decision criteria for each axis. For GPU Operator details (ClusterPolicy, DCGM, MIG, Time-Slicing, Dynamo, KAI Scheduler, and other NVIDIA software stack components), see NVIDIA GPU Stack.
Karpenter GPU NodePool
Karpenter has been GA since v1.0, and all examples in this document use the karpenter.sh/v1 API.
GPU Node Auto-Provisioning Concept
Karpenter analyzes Pending Pod resource requests (nvidia.com/gpu, memory, CPU) to automatically provision the optimal EC2 instance. The core value of Karpenter for GPU workloads includes:
- Instance diversity: Support for various GPU instances (p4d, p5, g5, g6e, etc.) in a single NodePool
- Spot/On-Demand mix: Balance cost and stability with capacity-type
- Consolidation: Automatically clean up idle GPU nodes for cost savings
- Taint-based isolation: Set
nvidia.com/gputaint on GPU nodes to exclude non-GPU workloads
NodePool Configuration Example
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-inference-pool
spec:
template:
metadata:
labels:
node-type: gpu-inference
workload: genai
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- p4d.24xlarge # 8x A100 40GB
- p5.48xlarge # 8x H100 80GB
- g5.48xlarge # 8x A10G 24GB
- key: karpenter.k8s.aws/instance-gpu-count
operator: Gt
values: ["0"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-nodeclass
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
limits:
cpu: 1000
memory: 4000Gi
nvidia.com/gpu: 64
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
weight: 100
Design Points:
limits.nvidia.com/gpu: 64— Cluster-wide GPU cap to prevent cost runawaydisruption.consolidateAfter: 30s— Quick cleanup is key since GPU nodes are expensiveweight: 100— Priority setting among multiple NodePools
GPU Instance Type Comparison
| Instance Type | GPU | GPU Memory | vCPU | Memory | Network | Use Case |
|---|---|---|---|---|---|---|
| p4d.24xlarge | 8x A100 | 40GB x 8 | 96 | 1152 GiB | 400 Gbps EFA | Large-scale LLM inference |
| p5.48xlarge | 8x H100 | 80GB x 8 | 192 | 2048 GiB | 3200 Gbps EFA | Ultra-large models, training |
| p5e.48xlarge | 8x H200 | 141GB x 8 | 192 | 2048 GiB | 3200 Gbps EFA | Large model training/inference |
| g5.48xlarge | 8x A10G | 24GB x 8 | 192 | 768 GiB | 100 Gbps | Small/medium model inference |
| g6e.xlarge ~ g6e.48xlarge | NVIDIA L40S | Up to 8x48GB | Up to 192 | Up to 768 GiB | Up to 100 Gbps | Cost-efficient inference |
| trn2.48xlarge | 16x Trainium2 | - | 192 | 2048 GiB | 1600 Gbps | AWS native training |
- p5e.48xlarge: 100B+ parameter models, maximize H200 memory
- p5.48xlarge: 70B+ parameter models, highest performance requirements
- p4d.24xlarge: 13B-70B parameter models, balanced cost-performance
- g6e: 13B-70B models, cost-efficient inference with L40S
- g5.48xlarge: 7B and below models, cost-efficient inference
- trn2.48xlarge: AWS native training workloads
EKS Auto Mode automatically detects GPU workloads and provisions appropriate GPU instances. Without separate NodePool configuration, it selects optimal instances based on Pod resource requests.
Kubernetes GPU Scheduling
Device Plugin Model
The default method for using GPUs in Kubernetes is the NVIDIA Device Plugin. It registers nvidia.com/gpu extended resources with kubelet, and Pods specify GPU count in resources.requests.
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
Device Plugin is simple and stable but can only allocate GPUs as whole units and cannot do attribute-based selection (e.g., MIG profiles, specific GPU models).
Topology-Aware Routing
Topology-Aware Routing, stabilized in K8s 1.33+, minimizes network latency between GPU nodes. It prioritizes routing traffic to GPU nodes within the same AZ (availability zone), particularly improving performance for multi-node tensor parallelism workloads.
apiVersion: v1
kind: Service
metadata:
name: vllm-inference
annotations:
service.kubernetes.io/topology-mode: "Auto"
spec:
selector:
app: vllm
ports:
- port: 8000
trafficDistribution: PreferClose
Gang Scheduling
For large-scale LLM training or tensor parallel inference, multiple GPU Pods must be scheduled simultaneously. If only some are placed, the rest remain Pending and occupy resources, creating a deadlock.
Solutions:
- Coscheduling Plugin (scheduler-plugins): PodGroup CRD to specify minimum Pod count for all-or-nothing scheduling
- Volcano: Batch scheduler with native Gang Scheduling support
- KAI Scheduler: NVIDIA's GPU-aware scheduler with GPU topology-aware Gang Scheduling (details in NVIDIA GPU Stack)
DRA (Dynamic Resource Allocation)
Concept and Necessity
DRA is Kubernetes' new GPU resource management paradigm that overcomes Device Plugin limitations.
- K8s 1.26-1.30: Alpha (
v1alpha2API, feature gate required) - K8s 1.31: Promoted to Beta, enabled by default
- K8s 1.32: New implementation (KEP #4381),
v1beta1API - K8s 1.33+:
v1beta1stabilized - K8s 1.34+: DRA GA (Stable), prioritized alternatives support
- K8s 1.35: GA, recommended for production
DRA Core Model
DRA separates declarative resource requests (ResourceClaim) from immediate allocation. When a Pod requests GPUs based on attributes like "1 H100 GPU, MIG 3g.20gb profile", the DRA Driver matches it with actual hardware.
DRA vs Device Plugin Comparison
| Item | Device Plugin | DRA |
|---|---|---|
| Resource Allocation | Static registration at node start | Dynamic allocation at Pod scheduling |
| Allocation Unit | Whole GPU only | GPU partitioning possible (MIG, Time-Slicing) |
| Attribute-based Selection | Not possible (index-based) | GPU attribute matching via CEL expressions |
| Multi-resource Coordination | Not possible | Pod-level coordination of multiple resources |
| Karpenter Compatible | Fully supported | Not supported (MNG required) |
| Maturity | Production | K8s 1.34+ GA |
Node Provisioning Compatibility
| Node Provisioning | DRA Compatible | Notes |
|---|---|---|
| Managed Node Group | ✅ Supported | Recommended |
| Self-Managed Node Group | ✅ Supported | Manual configuration required |
| Karpenter | ❌ Not supported | Skips Pods with ResourceClaim |
| EKS Auto Mode | ❌ Not supported | Same limitation due to internal Karpenter |
Why Karpenter cannot support DRA:
Karpenter analyzes Pod requirements to calculate optimal instances for nodes that don't yet exist. This calculation is impossible with DRA.
- ResourceSlice is created after node exists: DRA Driver issues ResourceSlice after detecting GPUs on the node, but Karpenter needs this information before node creation (chicken-and-egg problem)
- No instance→ResourceSlice mapping: With Device Plugin,
p5.48xlarge → nvidia.com/gpu: 8is statically known, but with DRA the content varies by Driver implementation - CEL expression simulation impossible: ResourceSlice attribute values needed for evaluation don't exist before node creation
In contrast, Cluster Autoscaler works without interpreting DRA. It only needs the simple decision "there are Pending Pods, so scale up MNG."
DRA Selection Guide
DRA is needed when:
- GPU partitioning required (MIG, Time-Slicing, MPS)
- CEL-based GPU attribute selection in multi-tenant environments
- Topology-aware scheduling (NVLink, NUMA)
- P6e-GB200 UltraServer environments (DRA required)
- K8s 1.34+ environments
Device Plugin is sufficient when:
- Only whole GPU allocation needed
- Using Karpenter or EKS Auto Mode
- K8s 1.33 or below
KEDA GPU-Based Autoscaling
Scaling Architecture
GPU workload autoscaling operates as a 2-stage chain.
- Workload Scaling (KEDA/HPA): Adjust Pod count based on GPU metrics
- Node Scaling (Karpenter/CA): Auto-provision GPU nodes when Pending Pods occur
LLM Serving Metrics-Based ScaledObject
For LLM serving, KV Cache saturation, TTFT, and queue depth are more sensitive scaling signals than simple GPU utilization.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-serving-scaler
spec:
scaleTargetRef:
name: llm-serving
minReplicaCount: 2
maxReplicaCount: 10
triggers:
# KV Cache saturation — most sensitive signal for LLM serving
- type: prometheus
metadata:
query: avg(vllm_gpu_cache_usage_perc{model="exaone"})
threshold: "80"
# Waiting request count
- type: prometheus
metadata:
query: sum(vllm_num_requests_waiting{model="exaone"})
threshold: "10"
# TTFT SLO violation approaching
- type: prometheus
metadata:
query: |
histogram_quantile(0.95,
rate(vllm_time_to_first_token_seconds_bucket[5m]))
threshold: "2"
Disaggregated Serving Scaling Criteria
When operating Prefill and Decode separately, the bottleneck signals differ for each role.
| Prefill | Decode | |
|---|---|---|
| Bottleneck Signal | TTFT increase, input queue backlog | TPS decrease, KV Cache saturation |
| Scale Criteria | Input token processing wait time | Concurrent generation session count |
| Scale Unit | GPU compute intensive | GPU memory intensive |
Recommended Scaling Thresholds
| Workload Type | Scale Up Threshold | Scale Down Threshold | Cooldown |
|---|---|---|---|
| Real-time Inference | GPU 70% | GPU 30% | 60s |
| Batch Processing | GPU 85% | GPU 40% | 300s |
| Conversational Service | GPU 60% | GPU 25% | 30s |
DRA Workload Scale-out
DRA workloads cannot use Karpenter, so they are configured with MNG + Cluster Autoscaler + KEDA.
LLM Metrics (KV Cache, TTFT, Queue)
→ KEDA: Pod scale-out
→ kube-scheduler: ResourceClaim matching attempt
├─ Success → Place on existing node
└─ Failure → Pod Pending
→ Cluster Autoscaler: MNG +1
→ New GPU node → DRA Driver install
→ ResourceSlice creation → Pod placement
Cost Optimization Strategies
GPU Workload Cost Comparison
Inference Workloads (per hour)
| Component | Purpose | AWS Integration |
|---|---|---|
| DCGM-Exporter | Collect GPU metrics | CloudWatch Container Insights |
| Karpenter GPU NodePool | Provision GPU nodes | EC2 Spot API, CloudWatch metrics |
| CloudWatch Dashboard | Visualize GPU health | Native AWS service |
| CloudWatch Alarms | Alert on GPU issues | SNS notifications |
| IAM Roles (IRSA) | Secure S3 model access | Pod-level permissions |
Training Workloads (per hour)
| Component | Purpose | Scaling Trigger |
|---|---|---|
| KEDA | Pod autoscaling | Redis queue depth, SQS, CloudWatch |
| Karpenter | Node autoscaling | Pod pressure from KEDA scaling |
| ALB Ingress | Multi-model routing | Path-based routing |
| Redis Streams | Task queue | Persistent, distributed queue |
| CloudWatch | Observability | Custom metrics for latency, throughput |
Cost Optimization Strategy Effects
| Component | Purpose | Cost Optimization |
|---|---|---|
| Dedicated NodePool | Isolate training from inference | Spot instances, right-sized for training |
| Kubeflow/AWS Batch | Distributed training orchestration | Multi-node GPU utilization |
| Checkpointing | Spot interruption recovery | Minimize wasted compute |
| FSx for Lustre | High-throughput data access | Reduce training time |
| EFA Networking | Low-latency GPU communication | Faster distributed training |
4 Key Karpenter-Based Cost Optimization Strategies
| Feature | Benefit | Configuration |
|---|---|---|
| Spot + On-Demand Mix | 70% cost savings with automatic fallback | `capacity-type: [spot, on-demand]` |
| Multi-Instance Support | Select optimal GPU type per workload | `instance-family: [g5, g6, p4d, p5]` |
| Consolidation | Bin-pack pods to minimize GPU waste | `consolidationPolicy: WhenUnderutilized` |
| Graceful Disruption | Respect PDBs during node replacement | `budgets: nodes: 10%` |
| Fast Scaling | Provision GPU nodes in under 60 seconds | Direct EC2 API calls |
| Custom AMIs | Pre-loaded models and drivers | `amiSelectorTerms` |
| Strategy | Core Mechanism | Expected Savings | Target |
|---|---|---|---|
| Spot Instance Priority | capacity-type: spot + diverse instance types | 60-90% | Inference (stateless) workloads |
| Time-based Disruption Budget | Business hours nodes: 10%, off-hours nodes: 50% | 30-40% | Services with clear business hour patterns |
| Consolidation | WhenEmptyOrUnderutilized + consolidateAfter: 30s | 20-30% | All GPU workloads |
| Per-workload Instance Optimization | Small models→g5, large models→p5, weight for priority | 15-25% | Operating various model sizes |
Inference workloads: Spot (70%) + Consolidation (20%) + Time-based scheduling (30%) = ~85% total savings
Training workloads: Savings Plans 1-year commitment (35%) + Spot for experiments (40%) + checkpoint restart = ~60% total savings
LLMOps Cost Governance
Both infrastructure costs and token-level costs must be tracked for complete cost visibility.
- Infrastructure Layer (Bifrost/LiteLLM): Per-model token pricing, per-team/project budget allocation, monthly cost reports
- Application Layer (Langfuse): Per-agent-workflow-step token consumption, end-to-end cost, trace-based bottleneck analysis
- Interruption handling: 2-minute advance notice. Implement graceful shutdown with
terminationGracePeriodSecondsandpreStophooks - Workload suitability: Suitable for stateless inference workloads
- Availability: Spot availability for specific instance types may be low; specify diverse types
Cost Optimization Checklist
| Item | Description | Expected Savings |
|---|---|---|
| Spot Instance Usage | Non-production and fault-tolerant workloads | 60-90% |
| Enable Consolidation | Auto-cleanup of idle nodes | 20-30% |
| Right-sizing | Select instances matching workloads | 15-25% |
| Schedule-based Scaling | Reduce resources during off-hours | 30-40% |
Related Documents
- NVIDIA GPU Stack — GPU Operator, DCGM, MIG, Time-Slicing, Dynamo
- EKS GPU Node Strategy — Auto Mode + Karpenter + Hybrid Node configuration
- vLLM Model Serving — Inference engine deployment