Skip to main content

GPU Resource Management

GPU resource management strategies in EKS environments are organized around three axes.

AxisKey QuestionCore Technologies
ProvisioningWhich GPU nodes to create and when?Karpenter, EKS Auto Mode, Managed Node Group
SchedulingWhich node to place GPU Pods on?Device Plugin, DRA, Topology-Aware Routing
ScalingHow to respond to traffic changes?KEDA, HPA, Cluster Autoscaler

This document covers the architecture and design decision criteria for each axis. For GPU Operator details (ClusterPolicy, DCGM, MIG, Time-Slicing, Dynamo, KAI Scheduler, and other NVIDIA software stack components), see NVIDIA GPU Stack.


Karpenter GPU NodePool

Karpenter v1.2+ GA

Karpenter has been GA since v1.0, and all examples in this document use the karpenter.sh/v1 API.

GPU Node Auto-Provisioning Concept

Karpenter analyzes Pending Pod resource requests (nvidia.com/gpu, memory, CPU) to automatically provision the optimal EC2 instance. The core value of Karpenter for GPU workloads includes:

  • Instance diversity: Support for various GPU instances (p4d, p5, g5, g6e, etc.) in a single NodePool
  • Spot/On-Demand mix: Balance cost and stability with capacity-type
  • Consolidation: Automatically clean up idle GPU nodes for cost savings
  • Taint-based isolation: Set nvidia.com/gpu taint on GPU nodes to exclude non-GPU workloads

NodePool Configuration Example

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-inference-pool
spec:
template:
metadata:
labels:
node-type: gpu-inference
workload: genai
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- p4d.24xlarge # 8x A100 40GB
- p5.48xlarge # 8x H100 80GB
- g5.48xlarge # 8x A10G 24GB
- key: karpenter.k8s.aws/instance-gpu-count
operator: Gt
values: ["0"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-nodeclass
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
limits:
cpu: 1000
memory: 4000Gi
nvidia.com/gpu: 64
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
weight: 100

Design Points:

  • limits.nvidia.com/gpu: 64 — Cluster-wide GPU cap to prevent cost runaway
  • disruption.consolidateAfter: 30s — Quick cleanup is key since GPU nodes are expensive
  • weight: 100 — Priority setting among multiple NodePools

GPU Instance Type Comparison

Instance TypeGPUGPU MemoryvCPUMemoryNetworkUse Case
p4d.24xlarge8x A10040GB x 8961152 GiB400 Gbps EFALarge-scale LLM inference
p5.48xlarge8x H10080GB x 81922048 GiB3200 Gbps EFAUltra-large models, training
p5e.48xlarge8x H200141GB x 81922048 GiB3200 Gbps EFALarge model training/inference
g5.48xlarge8x A10G24GB x 8192768 GiB100 GbpsSmall/medium model inference
g6e.xlarge ~ g6e.48xlargeNVIDIA L40SUp to 8x48GBUp to 192Up to 768 GiBUp to 100 GbpsCost-efficient inference
trn2.48xlarge16x Trainium2-1922048 GiB1600 GbpsAWS native training
Instance Selection Guide
  • p5e.48xlarge: 100B+ parameter models, maximize H200 memory
  • p5.48xlarge: 70B+ parameter models, highest performance requirements
  • p4d.24xlarge: 13B-70B parameter models, balanced cost-performance
  • g6e: 13B-70B models, cost-efficient inference with L40S
  • g5.48xlarge: 7B and below models, cost-efficient inference
  • trn2.48xlarge: AWS native training workloads
EKS Auto Mode

EKS Auto Mode automatically detects GPU workloads and provisions appropriate GPU instances. Without separate NodePool configuration, it selects optimal instances based on Pod resource requests.


Kubernetes GPU Scheduling

Device Plugin Model

The default method for using GPUs in Kubernetes is the NVIDIA Device Plugin. It registers nvidia.com/gpu extended resources with kubelet, and Pods specify GPU count in resources.requests.

resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1

Device Plugin is simple and stable but can only allocate GPUs as whole units and cannot do attribute-based selection (e.g., MIG profiles, specific GPU models).

Topology-Aware Routing

Topology-Aware Routing, stabilized in K8s 1.33+, minimizes network latency between GPU nodes. It prioritizes routing traffic to GPU nodes within the same AZ (availability zone), particularly improving performance for multi-node tensor parallelism workloads.

apiVersion: v1
kind: Service
metadata:
name: vllm-inference
annotations:
service.kubernetes.io/topology-mode: "Auto"
spec:
selector:
app: vllm
ports:
- port: 8000
trafficDistribution: PreferClose

Gang Scheduling

For large-scale LLM training or tensor parallel inference, multiple GPU Pods must be scheduled simultaneously. If only some are placed, the rest remain Pending and occupy resources, creating a deadlock.

Solutions:

  • Coscheduling Plugin (scheduler-plugins): PodGroup CRD to specify minimum Pod count for all-or-nothing scheduling
  • Volcano: Batch scheduler with native Gang Scheduling support
  • KAI Scheduler: NVIDIA's GPU-aware scheduler with GPU topology-aware Gang Scheduling (details in NVIDIA GPU Stack)

DRA (Dynamic Resource Allocation)

Concept and Necessity

DRA is Kubernetes' new GPU resource management paradigm that overcomes Device Plugin limitations.

⚠️ Fundamental Limitations of Device Plugin Model
LimitationDescriptionImpact
Static AllocationResource quantities fixed at node startupCannot allocate partial GPU, low utilization
No Fine-Grained ControlCan only allocate entire GPU to PodNo GPU partitioning support (MIG unavailable)
No Priority SupportOnly first-come-first-served allocationQoS classes not applied, difficult to ensure fair resource distribution
No Dynamic RequirementsCannot change resources at runtimeInitial request values fixed, difficult to scale
No Multi-Resource CoordinationCannot coordinate multiple resource typesPod receives 1 GPU but insufficient memory scenario
DRA Version History
  • K8s 1.26-1.30: Alpha (v1alpha2 API, feature gate required)
  • K8s 1.31: Promoted to Beta, enabled by default
  • K8s 1.32: New implementation (KEP #4381), v1beta1 API
  • K8s 1.33+: v1beta1 stabilized
  • K8s 1.34+: DRA GA (Stable), prioritized alternatives support
  • K8s 1.35: GA, recommended for production

DRA Core Model

DRA separates declarative resource requests (ResourceClaim) from immediate allocation. When a Pod requests GPUs based on attributes like "1 H100 GPU, MIG 3g.20gb profile", the DRA Driver matches it with actual hardware.

DRA vs Device Plugin Comparison

ItemDevice PluginDRA
Resource AllocationStatic registration at node startDynamic allocation at Pod scheduling
Allocation UnitWhole GPU onlyGPU partitioning possible (MIG, Time-Slicing)
Attribute-based SelectionNot possible (index-based)GPU attribute matching via CEL expressions
Multi-resource CoordinationNot possiblePod-level coordination of multiple resources
Karpenter CompatibleFully supportedNot supported (MNG required)
MaturityProductionK8s 1.34+ GA

Node Provisioning Compatibility

DRA is not compatible with Karpenter/Auto Mode
Node ProvisioningDRA CompatibleNotes
Managed Node Group✅ SupportedRecommended
Self-Managed Node Group✅ SupportedManual configuration required
Karpenter❌ Not supportedSkips Pods with ResourceClaim
EKS Auto Mode❌ Not supportedSame limitation due to internal Karpenter

Why Karpenter cannot support DRA:

Karpenter analyzes Pod requirements to calculate optimal instances for nodes that don't yet exist. This calculation is impossible with DRA.

  1. ResourceSlice is created after node exists: DRA Driver issues ResourceSlice after detecting GPUs on the node, but Karpenter needs this information before node creation (chicken-and-egg problem)
  2. No instance→ResourceSlice mapping: With Device Plugin, p5.48xlarge → nvidia.com/gpu: 8 is statically known, but with DRA the content varies by Driver implementation
  3. CEL expression simulation impossible: ResourceSlice attribute values needed for evaluation don't exist before node creation

In contrast, Cluster Autoscaler works without interpreting DRA. It only needs the simple decision "there are Pending Pods, so scale up MNG."

DRA Selection Guide

When to use DRA

DRA is needed when:

  • GPU partitioning required (MIG, Time-Slicing, MPS)
  • CEL-based GPU attribute selection in multi-tenant environments
  • Topology-aware scheduling (NVLink, NUMA)
  • P6e-GB200 UltraServer environments (DRA required)
  • K8s 1.34+ environments

Device Plugin is sufficient when:

  • Only whole GPU allocation needed
  • Using Karpenter or EKS Auto Mode
  • K8s 1.33 or below

KEDA GPU-Based Autoscaling

Scaling Architecture

GPU workload autoscaling operates as a 2-stage chain.

  1. Workload Scaling (KEDA/HPA): Adjust Pod count based on GPU metrics
  2. Node Scaling (Karpenter/CA): Auto-provision GPU nodes when Pending Pods occur

LLM Serving Metrics-Based ScaledObject

For LLM serving, KV Cache saturation, TTFT, and queue depth are more sensitive scaling signals than simple GPU utilization.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-serving-scaler
spec:
scaleTargetRef:
name: llm-serving
minReplicaCount: 2
maxReplicaCount: 10
triggers:
# KV Cache saturation — most sensitive signal for LLM serving
- type: prometheus
metadata:
query: avg(vllm_gpu_cache_usage_perc{model="exaone"})
threshold: "80"
# Waiting request count
- type: prometheus
metadata:
query: sum(vllm_num_requests_waiting{model="exaone"})
threshold: "10"
# TTFT SLO violation approaching
- type: prometheus
metadata:
query: |
histogram_quantile(0.95,
rate(vllm_time_to_first_token_seconds_bucket[5m]))
threshold: "2"

Disaggregated Serving Scaling Criteria

When operating Prefill and Decode separately, the bottleneck signals differ for each role.

PrefillDecode
Bottleneck SignalTTFT increase, input queue backlogTPS decrease, KV Cache saturation
Scale CriteriaInput token processing wait timeConcurrent generation session count
Scale UnitGPU compute intensiveGPU memory intensive
Workload TypeScale Up ThresholdScale Down ThresholdCooldown
Real-time InferenceGPU 70%GPU 30%60s
Batch ProcessingGPU 85%GPU 40%300s
Conversational ServiceGPU 60%GPU 25%30s

DRA Workload Scale-out

DRA workloads cannot use Karpenter, so they are configured with MNG + Cluster Autoscaler + KEDA.

LLM Metrics (KV Cache, TTFT, Queue)
→ KEDA: Pod scale-out
→ kube-scheduler: ResourceClaim matching attempt
├─ Success → Place on existing node
└─ Failure → Pod Pending
→ Cluster Autoscaler: MNG +1
→ New GPU node → DRA Driver install
→ ResourceSlice creation → Pod placement

Cost Optimization Strategies

GPU Workload Cost Comparison

Inference Workloads (per hour)

Spot Instance Pricing (Inference)
ComponentPurposeAWS Integration
DCGM-ExporterCollect GPU metricsCloudWatch Container Insights
Karpenter GPU NodePoolProvision GPU nodesEC2 Spot API, CloudWatch metrics
CloudWatch DashboardVisualize GPU healthNative AWS service
CloudWatch AlarmsAlert on GPU issuesSNS notifications
IAM Roles (IRSA)Secure S3 model accessPod-level permissions

Training Workloads (per hour)

Savings Plans Pricing (Training)
ComponentPurposeScaling Trigger
KEDAPod autoscalingRedis queue depth, SQS, CloudWatch
KarpenterNode autoscalingPod pressure from KEDA scaling
ALB IngressMulti-model routingPath-based routing
Redis StreamsTask queuePersistent, distributed queue
CloudWatchObservabilityCustom metrics for latency, throughput

Cost Optimization Strategy Effects

Cost Optimization Strategies Summary
ComponentPurposeCost Optimization
Dedicated NodePoolIsolate training from inferenceSpot instances, right-sized for training
Kubeflow/AWS BatchDistributed training orchestrationMulti-node GPU utilization
CheckpointingSpot interruption recoveryMinimize wasted compute
FSx for LustreHigh-throughput data accessReduce training time
EFA NetworkingLow-latency GPU communicationFaster distributed training

4 Key Karpenter-Based Cost Optimization Strategies

Karpenter GPU Workload Optimization
FeatureBenefitConfiguration
Spot + On-Demand Mix70% cost savings with automatic fallback`capacity-type: [spot, on-demand]`
Multi-Instance SupportSelect optimal GPU type per workload`instance-family: [g5, g6, p4d, p5]`
ConsolidationBin-pack pods to minimize GPU waste`consolidationPolicy: WhenUnderutilized`
Graceful DisruptionRespect PDBs during node replacement`budgets: nodes: 10%`
Fast ScalingProvision GPU nodes in under 60 secondsDirect EC2 API calls
Custom AMIsPre-loaded models and drivers`amiSelectorTerms`
StrategyCore MechanismExpected SavingsTarget
Spot Instance Prioritycapacity-type: spot + diverse instance types60-90%Inference (stateless) workloads
Time-based Disruption BudgetBusiness hours nodes: 10%, off-hours nodes: 50%30-40%Services with clear business hour patterns
ConsolidationWhenEmptyOrUnderutilized + consolidateAfter: 30s20-30%All GPU workloads
Per-workload Instance OptimizationSmall models→g5, large models→p5, weight for priority15-25%Operating various model sizes
Combined Cost Optimization Effect

Inference workloads: Spot (70%) + Consolidation (20%) + Time-based scheduling (30%) = ~85% total savings

Training workloads: Savings Plans 1-year commitment (35%) + Spot for experiments (40%) + checkpoint restart = ~60% total savings

LLMOps Cost Governance

Both infrastructure costs and token-level costs must be tracked for complete cost visibility.

  • Infrastructure Layer (Bifrost/LiteLLM): Per-model token pricing, per-team/project budget allocation, monthly cost reports
  • Application Layer (Langfuse): Per-agent-workflow-step token consumption, end-to-end cost, trace-based bottleneck analysis
Spot Instance Cautions
  • Interruption handling: 2-minute advance notice. Implement graceful shutdown with terminationGracePeriodSeconds and preStop hooks
  • Workload suitability: Suitable for stateless inference workloads
  • Availability: Spot availability for specific instance types may be low; specify diverse types

Cost Optimization Checklist

ItemDescriptionExpected Savings
Spot Instance UsageNon-production and fault-tolerant workloads60-90%
Enable ConsolidationAuto-cleanup of idle nodes20-30%
Right-sizingSelect instances matching workloads15-25%
Schedule-based ScalingReduce resources during off-hours30-40%

References