GPU Resource Management

GPU resource management strategies in EKS environments are organized around three axes.

Axis	Key Question	Core Technologies
Provisioning	Which GPU nodes to create and when?	Karpenter, EKS Auto Mode, Managed Node Group
Scheduling	Which node to place GPU Pods on?	Device Plugin, DRA, Topology-Aware Routing
Scaling	How to respond to traffic changes?	KEDA, HPA, Cluster Autoscaler

This document covers the architecture and design decision criteria for each axis. For GPU Operator details (ClusterPolicy, DCGM, MIG, Time-Slicing, Dynamo, KAI Scheduler, and other NVIDIA software stack components), see NVIDIA GPU Stack.

Karpenter GPU NodePool

Karpenter v1.2+ GA

Karpenter has been GA since v1.0, and all examples in this document use the karpenter.sh/v1 API.

GPU Node Auto-Provisioning Concept

Karpenter analyzes Pending Pod resource requests (nvidia.com/gpu, memory, CPU) to automatically provision the optimal EC2 instance. The core value of Karpenter for GPU workloads includes:

Instance diversity: Support for various GPU instances (p4d, p5, g5, g6e, etc.) in a single NodePool
Spot/On-Demand mix: Balance cost and stability with capacity-type
Consolidation: Automatically clean up idle GPU nodes for cost savings
Taint-based isolation: Set nvidia.com/gpu taint on GPU nodes to exclude non-GPU workloads

NodePool Configuration Example

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference-pool
spec:
  template:
    metadata:
      labels:
        node-type: gpu-inference
        workload: genai
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - p4d.24xlarge    # 8x A100 40GB
            - p5.48xlarge     # 8x H100 80GB
            - g5.48xlarge     # 8x A10G 24GB
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ["0"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-nodeclass
      taints:
        - key: nvidia.com/gpu
          value: "true"
          effect: NoSchedule
  limits:
    cpu: 1000
    memory: 4000Gi
    nvidia.com/gpu: 64
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  weight: 100

Design Points:

limits.nvidia.com/gpu: 64 — Cluster-wide GPU cap to prevent cost runaway
disruption.consolidateAfter: 30s — Quick cleanup is key since GPU nodes are expensive
weight: 100 — Priority setting among multiple NodePools

GPU Instance Type Comparison

Instance Type	GPU	GPU Memory	vCPU	Memory	Network	Use Case
p4d.24xlarge	8x A100	40GB x 8	96	1152 GiB	400 Gbps EFA	Large-scale LLM inference
p5.48xlarge	8x H100	80GB x 8	192	2048 GiB	3200 Gbps EFA	Ultra-large models, training
p5e.48xlarge	8x H200	141GB x 8	192	2048 GiB	3200 Gbps EFA	Large model training/inference
g5.48xlarge	8x A10G	24GB x 8	192	768 GiB	100 Gbps	Small/medium model inference
g6e.xlarge ~ g6e.48xlarge	NVIDIA L40S	Up to 8x48GB	Up to 192	Up to 768 GiB	Up to 100 Gbps	Cost-efficient inference
trn2.48xlarge	16x Trainium2	-	192	2048 GiB	1600 Gbps	AWS native training

Instance Selection Guide

p5e.48xlarge: 100B+ parameter models, maximize H200 memory
p5.48xlarge: 70B+ parameter models, highest performance requirements
p4d.24xlarge: 13B-70B parameter models, balanced cost-performance
g6e: 13B-70B models, cost-efficient inference with L40S
g5.48xlarge: 7B and below models, cost-efficient inference
trn2.48xlarge: AWS native training workloads

EKS Auto Mode

EKS Auto Mode automatically detects GPU workloads and provisions appropriate GPU instances. Without separate NodePool configuration, it selects optimal instances based on Pod resource requests.

Kubernetes GPU Scheduling

Device Plugin Model

The default method for using GPUs in Kubernetes is the NVIDIA Device Plugin. It registers nvidia.com/gpu extended resources with kubelet, and Pods specify GPU count in resources.requests.

resources:
  requests:
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1

Device Plugin is simple and stable but can only allocate GPUs as whole units and cannot do attribute-based selection (e.g., MIG profiles, specific GPU models).

Topology-Aware Routing

Topology-Aware Routing, stabilized in K8s 1.33+, minimizes network latency between GPU nodes. It prioritizes routing traffic to GPU nodes within the same AZ (availability zone), particularly improving performance for multi-node tensor parallelism workloads.

apiVersion: v1
kind: Service
metadata:
  name: vllm-inference
  annotations:
    service.kubernetes.io/topology-mode: "Auto"
spec:
  selector:
    app: vllm
  ports:
    - port: 8000
  trafficDistribution: PreferClose

Gang Scheduling

For large-scale LLM training or tensor parallel inference, multiple GPU Pods must be scheduled simultaneously. If only some are placed, the rest remain Pending and occupy resources, creating a deadlock.

Solutions:

Coscheduling Plugin (scheduler-plugins): PodGroup CRD to specify minimum Pod count for all-or-nothing scheduling
Volcano: Batch scheduler with native Gang Scheduling support
KAI Scheduler: NVIDIA's GPU-aware scheduler with GPU topology-aware Gang Scheduling (details in NVIDIA GPU Stack)

DRA (Dynamic Resource Allocation)

Concept and Necessity

DRA is Kubernetes' new GPU resource management paradigm that overcomes Device Plugin limitations.

⚠️ Fundamental Limitations of Device Plugin Model

Limitation	Description	Impact
Static Allocation	Resource quantities fixed at node startup	Cannot allocate partial GPU, low utilization
No Fine-Grained Control	Can only allocate entire GPU to Pod	No GPU partitioning support (MIG unavailable)
No Priority Support	Only first-come-first-served allocation	QoS classes not applied, difficult to ensure fair resource distribution
No Dynamic Requirements	Cannot change resources at runtime	Initial request values fixed, difficult to scale
No Multi-Resource Coordination	Cannot coordinate multiple resource types	Pod receives 1 GPU but insufficient memory scenario

DRA Version History

K8s 1.26-1.30: Alpha (v1alpha2 API, feature gate required)
K8s 1.31: Promoted to Beta, enabled by default
K8s 1.32: New implementation (KEP #4381), v1beta1 API
K8s 1.33+: v1beta1 stabilized
K8s 1.34+: DRA GA (Stable), prioritized alternatives support
K8s 1.35: GA, recommended for production

DRA Core Model

DRA separates declarative resource requests (ResourceClaim) from immediate allocation. When a Pod requests GPUs based on attributes like "1 H100 GPU, MIG 3g.20gb profile", the DRA Driver matches it with actual hardware.

DRA vs Device Plugin Comparison

Item	Device Plugin	DRA
Resource Allocation	Static registration at node start	Dynamic allocation at Pod scheduling
Allocation Unit	Whole GPU only	GPU partitioning possible (MIG, Time-Slicing)
Attribute-based Selection	Not possible (index-based)	GPU attribute matching via CEL expressions
Multi-resource Coordination	Not possible	Pod-level coordination of multiple resources
Karpenter Compatible	Fully supported	Not supported (MNG required)
Maturity	Production	K8s 1.34+ GA

Node Provisioning Compatibility

DRA is not compatible with Karpenter/Auto Mode

Node Provisioning	DRA Compatible	Notes
Managed Node Group	✅ Supported	Recommended
Self-Managed Node Group	✅ Supported	Manual configuration required
Karpenter	❌ Not supported	Skips Pods with ResourceClaim
EKS Auto Mode	❌ Not supported	Same limitation due to internal Karpenter

Why Karpenter cannot support DRA:

Karpenter analyzes Pod requirements to calculate optimal instances for nodes that don't yet exist. This calculation is impossible with DRA.

ResourceSlice is created after node exists: DRA Driver issues ResourceSlice after detecting GPUs on the node, but Karpenter needs this information before node creation (chicken-and-egg problem)
No instance→ResourceSlice mapping: With Device Plugin, p5.48xlarge → nvidia.com/gpu: 8 is statically known, but with DRA the content varies by Driver implementation
CEL expression simulation impossible: ResourceSlice attribute values needed for evaluation don't exist before node creation

In contrast, Cluster Autoscaler works without interpreting DRA. It only needs the simple decision "there are Pending Pods, so scale up MNG."

DRA Selection Guide

When to use DRA

DRA is needed when:

GPU partitioning required (MIG, Time-Slicing, MPS)
CEL-based GPU attribute selection in multi-tenant environments
Topology-aware scheduling (NVLink, NUMA)
P6e-GB200 UltraServer environments (DRA required)
K8s 1.34+ environments

Device Plugin is sufficient when:

Only whole GPU allocation needed
Using Karpenter or EKS Auto Mode
K8s 1.33 or below

KEDA GPU-Based Autoscaling

Scaling Architecture

GPU workload autoscaling operates as a 2-stage chain.

Workload Scaling (KEDA/HPA): Adjust Pod count based on GPU metrics
Node Scaling (Karpenter/CA): Auto-provision GPU nodes when Pending Pods occur

LLM Serving Metrics-Based ScaledObject

For LLM serving, KV Cache saturation, TTFT, and queue depth are more sensitive scaling signals than simple GPU utilization.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-serving-scaler
spec:
  scaleTargetRef:
    name: llm-serving
  minReplicaCount: 2
  maxReplicaCount: 10
  triggers:
    # KV Cache saturation — most sensitive signal for LLM serving
    - type: prometheus
      metadata:
        query: avg(vllm_gpu_cache_usage_perc{model="exaone"})
        threshold: "80"
    # Waiting request count
    - type: prometheus
      metadata:
        query: sum(vllm_num_requests_waiting{model="exaone"})
        threshold: "10"
    # TTFT SLO violation approaching
    - type: prometheus
      metadata:
        query: |
          histogram_quantile(0.95,
            rate(vllm_time_to_first_token_seconds_bucket[5m]))
        threshold: "2"

Disaggregated Serving Scaling Criteria

When operating Prefill and Decode separately, the bottleneck signals differ for each role.

	Prefill	Decode
Bottleneck Signal	TTFT increase, input queue backlog	TPS decrease, KV Cache saturation
Scale Criteria	Input token processing wait time	Concurrent generation session count
Scale Unit	GPU compute intensive	GPU memory intensive

Recommended Scaling Thresholds

Workload Type	Scale Up Threshold	Scale Down Threshold	Cooldown
Real-time Inference	GPU 70%	GPU 30%	60s
Batch Processing	GPU 85%	GPU 40%	300s
Conversational Service	GPU 60%	GPU 25%	30s

DRA Workload Scale-out

DRA workloads cannot use Karpenter, so they are configured with MNG + Cluster Autoscaler + KEDA.

LLM Metrics (KV Cache, TTFT, Queue)
  → KEDA: Pod scale-out
    → kube-scheduler: ResourceClaim matching attempt
      ├─ Success → Place on existing node
      └─ Failure → Pod Pending
           → Cluster Autoscaler: MNG +1
             → New GPU node → DRA Driver install
               → ResourceSlice creation → Pod placement

Cost Optimization Strategies

GPU Workload Cost Comparison

Inference Workloads (per hour)

Spot Instance Pricing (Inference)

Component	Purpose	AWS Integration
DCGM-Exporter	Collect GPU metrics	CloudWatch Container Insights
Karpenter GPU NodePool	Provision GPU nodes	EC2 Spot API, CloudWatch metrics
CloudWatch Dashboard	Visualize GPU health	Native AWS service
CloudWatch Alarms	Alert on GPU issues	SNS notifications
IAM Roles (IRSA)	Secure S3 model access	Pod-level permissions

Training Workloads (per hour)

Savings Plans Pricing (Training)

Component	Purpose	Scaling Trigger
KEDA	Pod autoscaling	Redis queue depth, SQS, CloudWatch
Karpenter	Node autoscaling	Pod pressure from KEDA scaling
ALB Ingress	Multi-model routing	Path-based routing
Redis Streams	Task queue	Persistent, distributed queue
CloudWatch	Observability	Custom metrics for latency, throughput

Cost Optimization Strategy Effects

Cost Optimization Strategies Summary

Component	Purpose	Cost Optimization
Dedicated NodePool	Isolate training from inference	Spot instances, right-sized for training
Kubeflow/AWS Batch	Distributed training orchestration	Multi-node GPU utilization
Checkpointing	Spot interruption recovery	Minimize wasted compute
FSx for Lustre	High-throughput data access	Reduce training time
EFA Networking	Low-latency GPU communication	Faster distributed training

4 Key Karpenter-Based Cost Optimization Strategies

Karpenter GPU Workload Optimization

Feature	Benefit	Configuration
Spot + On-Demand Mix	70% cost savings with automatic fallback	`capacity-type: [spot, on-demand]`
Multi-Instance Support	Select optimal GPU type per workload	`instance-family: [g5, g6, p4d, p5]`
Consolidation	Bin-pack pods to minimize GPU waste	`consolidationPolicy: WhenUnderutilized`
Graceful Disruption	Respect PDBs during node replacement	`budgets: nodes: 10%`
Fast Scaling	Provision GPU nodes in under 60 seconds	Direct EC2 API calls
Custom AMIs	Pre-loaded models and drivers	`amiSelectorTerms`

Strategy	Core Mechanism	Expected Savings	Target
Spot Instance Priority	`capacity-type: spot` + diverse instance types	60-90%	Inference (stateless) workloads
Time-based Disruption Budget	Business hours `nodes: 10%`, off-hours `nodes: 50%`	30-40%	Services with clear business hour patterns
Consolidation	`WhenEmptyOrUnderutilized` + `consolidateAfter: 30s`	20-30%	All GPU workloads
Per-workload Instance Optimization	Small models→g5, large models→p5, weight for priority	15-25%	Operating various model sizes

Combined Cost Optimization Effect

Inference workloads: Spot (70%) + Consolidation (20%) + Time-based scheduling (30%) = ~85% total savings

Training workloads: Savings Plans 1-year commitment (35%) + Spot for experiments (40%) + checkpoint restart = ~60% total savings

LLMOps Cost Governance

Both infrastructure costs and token-level costs must be tracked for complete cost visibility.

Infrastructure Layer (Bifrost/LiteLLM): Per-model token pricing, per-team/project budget allocation, monthly cost reports
Application Layer (Langfuse): Per-agent-workflow-step token consumption, end-to-end cost, trace-based bottleneck analysis

Spot Instance Cautions

Interruption handling: 2-minute advance notice. Implement graceful shutdown with terminationGracePeriodSeconds and preStop hooks
Workload suitability: Suitable for stateless inference workloads
Availability: Spot availability for specific instance types may be low; specify diverse types

Cost Optimization Checklist

Item	Description	Expected Savings
Spot Instance Usage	Non-production and fault-tolerant workloads	60-90%
Enable Consolidation	Auto-cleanup of idle nodes	20-30%
Right-sizing	Select instances matching workloads	15-25%
Schedule-based Scaling	Reduce resources during off-hours	30-40%

NVIDIA GPU Stack — GPU Operator, DCGM, MIG, Time-Slicing, Dynamo
EKS GPU Node Strategy — Auto Mode + Karpenter + Hybrid Node configuration
vLLM Model Serving — Inference engine deployment

Karpenter GPU NodePool​

GPU Node Auto-Provisioning Concept​

NodePool Configuration Example​

GPU Instance Type Comparison​

Kubernetes GPU Scheduling​

Device Plugin Model​

Topology-Aware Routing​

Gang Scheduling​

DRA (Dynamic Resource Allocation)​

Concept and Necessity​

DRA Core Model​

DRA vs Device Plugin Comparison​

Node Provisioning Compatibility​

DRA Selection Guide​

KEDA GPU-Based Autoscaling​

Scaling Architecture​

LLM Serving Metrics-Based ScaledObject​

Disaggregated Serving Scaling Criteria​

Recommended Scaling Thresholds​

DRA Workload Scale-out​

Cost Optimization Strategies​

GPU Workload Cost Comparison​

Inference Workloads (per hour)​

Training Workloads (per hour)​

Cost Optimization Strategy Effects​

4 Key Karpenter-Based Cost Optimization Strategies​

LLMOps Cost Governance​

Cost Optimization Checklist​

Related Documents​

References​