Skip to main content

GPU Resources · Observability · Hybrid Node · Lessons Learned

Overview

The majority of LLM serving operational costs come from GPU uptime, and achieving cost efficiency requires autoscaling, observability, fallback, and on-premises integration to work together organically. This document consolidates 2-Tier scaling, DCGM/vLLM monitoring, Bifrost→Bedrock Cascade Fallback, EKS Hybrid Node integration, and lessons learned from large MoE model deployments.

GPU Resource Management & Autoscaling

2-Tier Scaling Architecture

LLM serving configures Pod scaling and Node scaling in two stages.

KEDA Scaling Configuration

Three core scaling signals for LLM serving:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-inference-scaler
spec:
scaleTargetRef:
name: vllm-deployment
minReplicaCount: 2
maxReplicaCount: 8
triggers:
# 1. KV Cache saturation — most sensitive signal
- type: prometheus
metadata:
query: avg(vllm_gpu_cache_usage_perc)
threshold: "80"
# 2. Number of waiting requests
- type: prometheus
metadata:
query: sum(vllm_num_requests_waiting)
threshold: "10"
# 3. TTFT SLO violation proximity
- type: prometheus
metadata:
query: |
histogram_quantile(0.95,
rate(vllm_time_to_first_token_seconds_bucket[5m]))
threshold: "2"

Disaggregated Serving Scaling Criteria

Prefill and Decode have different bottleneck signals.

PrefillDecode
Bottleneck SignalTTFT increase, input queue backlogTPS decrease, KV Cache saturation
Scaling CriterionInput token processing wait timeConcurrent generation session count
GPU CharacteristicsCompute-intensive (compute bottleneck)Memory-intensive (bandwidth bottleneck)

DRA (Dynamic Resource Allocation) Reality

DRA provides GPU partitioning/topology-aware scheduling as v1beta1 in K8s 1.32+ and GA in 1.34+. However, there is an architectural limitation of incompatibility with Karpenter/Auto Mode.

  • Karpenter must simulate GPU resources before node creation, but DRA's ResourceSlice is published by DRA Driver after node creation
  • Due to this "chicken and egg" problem, DRA Pods are skipped in Karpenter
  • When Using DRA: MNG + Cluster Autoscaler required
DRA Usage Decision

When DRA is needed: MIG partitioning, CEL-based attribute GPU selection, P6e-GB200 environments

When Device Plugin is sufficient: Whole GPU unit allocation, Karpenter/Auto Mode usage

Cost Optimization Stack

Combining four strategies can achieve approximately 85% total cost reduction.

StrategyReduction EffectApplication Method
Spot Instances60-90%Karpenter capacity-type: spot, p5 Spot $13-15/hr (us-east-2)
Consolidation20-30%consolidationPolicy: WhenEmptyOrUnderutilized, 30s wait
Right-sizing15-25%Automatic instance type selection by model size (NodePool weight)
Time-based Scheduling30-40%disruption budget to reduce 50%+ during non-business hours
# Karpenter time-based disruption budget example
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
budgets:
# Business hours: stability priority
- nodes: "10%"
schedule: "0 9 * * 1-5"
duration: 9h
# Non-business hours: cost priority
- nodes: "50%"
schedule: "0 18 * * 1-5"
duration: 15h

Observability & Fallback Strategy

GPU Monitoring Stack

Core Monitoring Metrics

GPU Infrastructure Metrics (DCGM):

MetricDescriptionThreshold
DCGM_FI_DEV_GPU_UTILGPU SM utilization> 90%: warning, > 95%: critical
DCGM_FI_DEV_MEM_COPY_UTILMemory copy utilization> 80%: caution
DCGM_FI_DEV_FB_USEDFramebuffer usageAvailable memory < 10%: critical
DCGM_FI_DEV_POWER_USAGEGPU power consumptionCaution when approaching TDP

vLLM Inference Metrics:

MetricDescriptionThreshold
vllm:gpu_cache_usage_percKV Cache usage> 80%: scale out
vllm:num_requests_waitingWaiting requests> 10: scale out
vllm:time_to_first_token_secondsTTFTP95 > 2s: action required
vllm:num_preemptions_totalPreemption countHigh indicates memory shortage
vllm:avg_generation_throughput_toks_per_sGeneration throughputMonitor vs baseline

2-Tier Cost Tracking

Track both infrastructure and application levels for complete cost visibility.

  • Bifrost (Infrastructure Level): Token unit price per model, team/project budget management, monthly cost reports
  • Langfuse (Application Level): Token consumption per Agent workflow stage, chain end-to-end latency, Trace-based performance bottleneck analysis

This 2-Tier strategy enables simultaneous understanding of "which models were used how much" (infrastructure) and "which features drive costs" (application).

Bifrost → Bedrock Cascade Fallback

When self-hosted models (vLLM/llm-d) are overloaded or failing, Cascade Routing can be configured to automatically fallback to Amazon Bedrock's managed models. Bifrost (or LiteLLM) acts as the Gateway, switching requests to Bedrock on response failures/timeouts.

Bifrost Cascade Routing Configuration:

# bifrost-config.yaml
routing:
defaultModel: self-hosted-qwen3
strategy: cascade
cascadeOrder:
- self-hosted-qwen3 # Primary: EKS Self-hosted (cost optimized)
- self-hosted-glm5 # Secondary: EKS Self-hosted alternative
- bedrock-claude-sonnet # Tertiary: Bedrock managed (fallback)
fallbackConditions:
- statusCode: [500, 502, 503, 504]
- latencyMs: "> 30000" # Fallback if exceeding 30s
- errorRate: "> 0.1" # Fallback if error rate exceeds 10%

models:
- name: self-hosted-qwen3
provider: openai-compatible
baseUrl: http://inference-gateway.llm-d:8080/v1
model: Qwen/Qwen3-32B
priority: 1
costPer1kTokens: 0.001 # Self-hosted estimated cost

- name: self-hosted-glm5
provider: openai-compatible
baseUrl: http://glm5-service.agentic-serving:8000/v1
model: zai-org/GLM-5-FP8
priority: 2
costPer1kTokens: 0.003

- name: bedrock-claude-sonnet
provider: bedrock
model: anthropic.claude-sonnet-4-20250514
region: us-east-1
priority: 3
costPer1kTokens: 0.003 # Bedrock official pricing
maxTokens: 4096

Advantages of Cascade Routing:

PerspectiveSelf-hosted OnlyCascade (Self-hosted + Bedrock)
AvailabilityService interruption on GPU failureUninterrupted via Bedrock fallback
CostFixed GPU costRegular Self-hosted (low cost) + peak Bedrock (pay-as-you-go)
Capacity PlanningSecure GPU for peak trafficGPU for baseline traffic only, excess to Bedrock
Cold StartSeveral minutes delay on Spot interruptionBedrock immediate response
Cost Optimization Pattern

Processing 80% of regular traffic with Self-hosted and offloading 20% peak to Bedrock eliminates the need to provision GPUs for peak capacity, achieving an additional 30-40% infrastructure cost reduction. Bedrock also serves as immediate backup during Spot instance interruptions.

Hybrid Node: On-Premises GPU Farm Integration

Overview

EKS Hybrid Node is a feature for registering on-premises GPU servers to EKS clusters (GA November 2024). Existing DGX and GPU servers can be integrated with cloud EKS to build hybrid Inference architecture.

Hybrid Node Registration

# 1. Create Hybrid Node IAM Role
aws iam create-role \
--role-name EKSHybridNodeRole \
--assume-role-policy-document file://hybrid-node-trust-policy.json

aws iam attach-role-policy \
--role-name EKSHybridNodeRole \
--policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy

# 2. Register Hybrid Node from on-premises server
curl -o hybrid-node-installer.sh https://hybrid.eks.amazonaws.com/installer
chmod +x hybrid-node-installer.sh

sudo ./hybrid-node-installer.sh \
--cluster-name genai-platform \
--region us-west-2 \
--role-arn arn:aws:iam::123456789012:role/EKSHybridNodeRole \
--credential-provider ssm

# 3. Verify nodes
kubectl get nodes -l node.kubernetes.io/instance-type=hybrid

Hybrid Node GPU Operator Installation

On-premises nodes lack AWS-managed GPU stack, so GPU Operator is required.

# GPU Operator Helm Values (Hybrid Node dedicated)
driver:
enabled: true # On-premises: driver installation required
version: "580.126.18"
nodeSelector:
node.kubernetes.io/instance-type: hybrid

toolkit:
enabled: true
nodeSelector:
node.kubernetes.io/instance-type: hybrid

devicePlugin:
enabled: true # On-premises: Device Plugin installation required
nodeSelector:
node.kubernetes.io/instance-type: hybrid

dcgmExporter:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
location: on-premises # Separate on-premises/cloud metrics
nodeSelector:
node.kubernetes.io/instance-type: hybrid

3-Tier Cascade: On-Prem → Cloud → Bedrock

Combining Hybrid Node with Bifrost Cascade creates a 3-Tier architecture that maximizes both cost efficiency and availability.

TierInfrastructureCost StructureRole
Tier 1On-Prem Hybrid Node (DGX)Fixed cost (already owned)Handle baseline traffic (always active)
Tier 2Cloud GPU (EKS Spot/OD)Variable cost (hourly)Peak traffic bursts
Tier 3Amazon BedrockPay-as-you-go (per token)Failure/overload fallback
# Bifrost 3-Tier Cascade configuration
routing:
strategy: cascade
cascadeOrder:
- onprem-dgx-llm # Primary: On-Prem (fixed cost, always active)
- cloud-eks-llm # Secondary: Cloud GPU (Spot, elastic)
- bedrock-fallback # Tertiary: Bedrock (pay-as-you-go, unlimited capacity)

models:
- name: onprem-dgx-llm
provider: openai-compatible
baseUrl: http://hybrid-node-vllm.inference:8000/v1
model: Qwen/Qwen3-32B
priority: 1
healthCheck:
endpoint: /health
intervalMs: 10000

- name: cloud-eks-llm
provider: openai-compatible
baseUrl: http://inference-gateway.llm-d:8080/v1
model: Qwen/Qwen3-32B
priority: 2

- name: bedrock-fallback
provider: bedrock
model: anthropic.claude-sonnet-4-20250514
region: us-east-1
priority: 3

Pod Placement Strategy: Workload Separation with nodeSelector

# Deploy on On-Prem Hybrid Node (baseline inference)
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-onprem
spec:
template:
spec:
nodeSelector:
node.kubernetes.io/instance-type: hybrid
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.3
args: ["Qwen/Qwen3-32B-FP8", "--gpu-memory-utilization=0.95"]
resources:
limits:
nvidia.com/gpu: 1
---
# Deploy on Cloud GPU Node (burst traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-cloud-burst
spec:
template:
spec:
nodeSelector:
karpenter.sh/nodepool: gpu-inference # Cloud Karpenter NodePool
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.3
args: ["Qwen/Qwen3-32B-FP8", "--gpu-memory-utilization=0.95"]
resources:
limits:
nvidia.com/gpu: 1
Hybrid Node Network Considerations
  • Latency: 10-50ms additional delay via VPN/Direct Connect compared to cloud nodes
  • Bandwidth: Multi-node NCCL communication requires high bandwidth → On-Prem internal PP is feasible, but On-Prem↔Cloud PP is not recommended
  • Recommendation: On-Prem nodes serve independent models, connect with Cloud nodes via Cascade Routing at Gateway level

Lessons Learned: Large MoE Model Deployment

Image/Model Download Failure Mitigation

Large model (744GB+) weight download is the most common Cold Start bottleneck in LLM serving. Downloading hundreds of GB from HuggingFace Hub frequently fails due to network instability, timeouts, and disk shortage.

Problem Types and Responses

ProblemSymptomsResponse
HF Hub Download TimeoutPod CrashLoopBackOff, ConnectionErrorRetry + resume support (HF_HUB_ENABLE_HF_TRANSFER=1)
Large File Partial DownloadCorruption error during model loadingChecksum verification + re-download
Slow Container Image PullImagePullBackOff, several minutes waitPre-cache images (Bottlerocket data volume, SOCI)
Multi-node Simultaneous DownloadNetwork bandwidth contentionS3 caching + init container sequential loading
Slow EFS Download30+ minutes loading timeSwitch to NVMe emptyDir

Strategy 1: HuggingFace Transfer Acceleration

hf_transfer is a Rust-based high-speed download library, 3-5x faster than default download.

env:
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "1"
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
# Download retry configuration
- name: HF_HUB_DOWNLOAD_TIMEOUT
value: "600" # 10 minute timeout

Strategy 2: S3 Pre-caching + Init Container

Most stable method. Pre-upload model weights to S3, copy to local NVMe in init container.

apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-with-s3-cache
spec:
template:
spec:
initContainers:
# Stage 1: Download model from S3 to NVMe
- name: model-downloader
image: amazon/aws-cli:latest
command: ["/bin/sh", "-c"]
args:
- |
echo "Checking local cache..."
if [ -f /models/config.json ]; then
echo "Model already cached, skipping download"
exit 0
fi
echo "Downloading model from S3..."
aws s3 sync s3://model-cache/qwen3-32b-fp8/ /models/ \
--no-progress \
--expected-size 65000000000
echo "Download complete, verifying..."
# Checksum verification
if [ -f /models/model.safetensors.index.json ]; then
echo "Model verified successfully"
else
echo "ERROR: Model incomplete, retrying..."
rm -rf /models/*
aws s3 sync s3://model-cache/qwen3-32b-fp8/ /models/
fi
volumeMounts:
- name: model-cache
mountPath: /models
resources:
requests:
cpu: 2
memory: 4Gi
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.3
args:
- /models
- "--gpu-memory-utilization=0.95"
volumeMounts:
- name: model-cache
mountPath: /models
volumes:
- name: model-cache
emptyDir:
sizeLimit: 200Gi # NVMe emptyDir

Strategy 3: Container Image Pre-caching

Methods to reduce Pull time for vLLM/SGLang images (10-20GB).

# Enable image pre-Pull in Karpenter NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-inference
spec:
template:
spec:
kubelet:
# Raise image GC threshold to maintain cache
imageGCHighThresholdPercent: 90
imageGCLowThresholdPercent: 85

Using SOCI (Seekable OCI) Index:

Creating SOCI index in ECR enables image lazy-loading via Pull, reducing container start time by 70-80%.

# Create SOCI index (ECR)
aws soci create \
--image-uri 123456789012.dkr.ecr.us-east-2.amazonaws.com/vllm:v0.6.3

# EKS Auto Mode automatically supports SOCI
# Karpenter: Native SOCI support when using Bottlerocket AMI

Strategy 4: Multi-node LWS Model Download Coordination

When deploying with LWS multi-node, network contention occurs if Leader and Worker simultaneously download the same model.

# Leader Pod: Download from S3 then cache to NVMe
initContainers:
- name: model-downloader
command: ["/bin/sh", "-c"]
args:
- |
# Only Leader downloads from S3
aws s3 sync s3://model-cache/glm5-fp8/ /models/
echo "READY" > /models/.download-complete

# Worker Pod: Wait for Leader completion then download independently
initContainers:
- name: model-downloader
command: ["/bin/sh", "-c"]
args:
- |
# Worker downloads independently from S3
# (NVMe emptyDir is node-independent, cannot share)
aws s3 sync s3://model-cache/glm5-fp8/ /models/
Download Performance Comparison
Method744GB Model TimeStabilityCost
HF Hub Direct20-40minFrequent timeoutsFree
HF Hub + hf_transfer10-15minGoodFree
S3 Pre-caching5-10minVery StableS3 Storage Cost
FSx for Lustre5-8minStableHigh
NVMe Local Cache (Restart)< 1minBestFree

EKS Auto Mode GPU Limitations

Core limitations identified during GLM-5 (744B MoE) and Kimi K2.5 (1T MoE) deployments.

p6-b200 Not Supported

As of April 2026, EKS Auto Mode's managed Karpenter cannot provision p6-b200.48xlarge. NodePool validation passes but actual NodeClaim creation fails with NoCompatibleInstanceTypes error.

GPU Instance Capacity Acquisition

p5.48xlarge frequently has InsufficientCapacity in Seoul/Tokyo regions. Available in us-east-2 (Ohio) Spot for $13-15/hr (85% reduction vs On-Demand $98/hr).

Regionp5.48xlarge On-Demandp5.48xlarge SpotSpot Price
ap-northeast-2 (Seoul)InsufficientCapacityUnconfirmed
ap-northeast-1 (Tokyo)InsufficientCapacityUnconfirmed
us-east-2 (Ohio)Variable availabilityAvailable$13~15/hr

GPU Operator Conflict

Installing GPU Operator with devicePlugin.enabled=true conflicts with Auto Mode's built-in Device Plugin, resulting in allocatable=0. Must install with devicePlugin.enabled=false.

Cannot Directly Terminate EC2 Instances

Auto Mode managed nodes block ec2:TerminateInstances via resource-based policy. Node cleanup must be performed indirectly through Karpenter NodePool deletion or Pod removal.

Serving Framework Compatibility

ModelvLLM SupportSGLang SupportNotes
Qwen3-32BSupportedSupportedllm-d default model, Apache 2.0
Kimi K2.5 (1T MoE)SupportedSupportedINT4 W4A16 Marlin MoE, gpu_memory_utilization=0.85
GLM-5 (744B MoE)Not supportedSupportedglm_moe_dsa architecture → requires transformers v5.2+, vLLM uses v4.x
DeepSeek V3.2SupportedSupportedMoE, 671B/37B active
GLM-5 Deployment Caution

GLM-5 is not supported in vLLM. Must use SGLang-dedicated image (lmsysorg/sglang:glm5-hopper), and configure --pp-size 2 --nnodes 2 --dist-init-addr <leader>:5000 for multi-node deployment.

Storage Strategy

Storage performance is critical for large model (744GB+) weight loading.

StorageSequential ReadMulti-node SharingRecommended Scenario
NVMe emptyDir~3,500 MB/sNode-independentp5 built-in NVMe, best performance
EFS~100-300 MB/sReadWriteManySmall models, when sharing needed
S3 + init container~1,000 MB/sS3 sharedMedium performance, cost efficient
FSx for Lustre~1,000+ MB/sReadWriteManyTraining workloads
Large Model Recommendation

Large models like GLM-5 (744GB) and Kimi K2.5 (630GB) recommend local NVMe (emptyDir). p5.48xlarge has 8×3.84TB NVMe SSD built-in, providing best performance at no additional cost. First startup takes 10-20min with HuggingFace Hub direct download, but subsequent loads are fast.

GPU Quota Pitfall

EC2 vCPU quotas are separated by instance bucket, requiring caution.

QuotaApplicable InstancesDefaultCaution
Running On-Demand P instancesp4d, p5, p5en384Can have 2 p5.48xlarge (192 vCPU each)
Running On-Demand G and VT instancesg5, g6, g6e64Cannot even have 1 g6e.48xlarge → quota increase required

Setting instance-category: [g, p] together in GPU NodePool may cause Karpenter to try G types first, hitting the G quota (64 vCPU). If only P types are needed, specify explicitly.

References

Official Documentation

Papers & Technical Blogs