GPU Resources · Observability · Hybrid Node · Lessons Learned
Overview
The majority of LLM serving operational costs come from GPU uptime, and achieving cost efficiency requires autoscaling, observability, fallback, and on-premises integration to work together organically. This document consolidates 2-Tier scaling, DCGM/vLLM monitoring, Bifrost→Bedrock Cascade Fallback, EKS Hybrid Node integration, and lessons learned from large MoE model deployments.
GPU Resource Management & Autoscaling
2-Tier Scaling Architecture
LLM serving configures Pod scaling and Node scaling in two stages.
KEDA Scaling Configuration
Three core scaling signals for LLM serving:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-inference-scaler
spec:
scaleTargetRef:
name: vllm-deployment
minReplicaCount: 2
maxReplicaCount: 8
triggers:
# 1. KV Cache saturation — most sensitive signal
- type: prometheus
metadata:
query: avg(vllm_gpu_cache_usage_perc)
threshold: "80"
# 2. Number of waiting requests
- type: prometheus
metadata:
query: sum(vllm_num_requests_waiting)
threshold: "10"
# 3. TTFT SLO violation proximity
- type: prometheus
metadata:
query: |
histogram_quantile(0.95,
rate(vllm_time_to_first_token_seconds_bucket[5m]))
threshold: "2"
Disaggregated Serving Scaling Criteria
Prefill and Decode have different bottleneck signals.
| Prefill | Decode | |
|---|---|---|
| Bottleneck Signal | TTFT increase, input queue backlog | TPS decrease, KV Cache saturation |
| Scaling Criterion | Input token processing wait time | Concurrent generation session count |
| GPU Characteristics | Compute-intensive (compute bottleneck) | Memory-intensive (bandwidth bottleneck) |
DRA (Dynamic Resource Allocation) Reality
DRA provides GPU partitioning/topology-aware scheduling as v1beta1 in K8s 1.32+ and GA in 1.34+. However, there is an architectural limitation of incompatibility with Karpenter/Auto Mode.
- Karpenter must simulate GPU resources before node creation, but DRA's ResourceSlice is published by DRA Driver after node creation
- Due to this "chicken and egg" problem, DRA Pods are skipped in Karpenter
- When Using DRA: MNG + Cluster Autoscaler required
When DRA is needed: MIG partitioning, CEL-based attribute GPU selection, P6e-GB200 environments
When Device Plugin is sufficient: Whole GPU unit allocation, Karpenter/Auto Mode usage
Cost Optimization Stack
Combining four strategies can achieve approximately 85% total cost reduction.
| Strategy | Reduction Effect | Application Method |
|---|---|---|
| Spot Instances | 60-90% | Karpenter capacity-type: spot, p5 Spot $13-15/hr (us-east-2) |
| Consolidation | 20-30% | consolidationPolicy: WhenEmptyOrUnderutilized, 30s wait |
| Right-sizing | 15-25% | Automatic instance type selection by model size (NodePool weight) |
| Time-based Scheduling | 30-40% | disruption budget to reduce 50%+ during non-business hours |
# Karpenter time-based disruption budget example
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
budgets:
# Business hours: stability priority
- nodes: "10%"
schedule: "0 9 * * 1-5"
duration: 9h
# Non-business hours: cost priority
- nodes: "50%"
schedule: "0 18 * * 1-5"
duration: 15h
Observability & Fallback Strategy
GPU Monitoring Stack
Core Monitoring Metrics
GPU Infrastructure Metrics (DCGM):
| Metric | Description | Threshold |
|---|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU SM utilization | > 90%: warning, > 95%: critical |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory copy utilization | > 80%: caution |
DCGM_FI_DEV_FB_USED | Framebuffer usage | Available memory < 10%: critical |
DCGM_FI_DEV_POWER_USAGE | GPU power consumption | Caution when approaching TDP |
vLLM Inference Metrics:
| Metric | Description | Threshold |
|---|---|---|
vllm:gpu_cache_usage_perc | KV Cache usage | > 80%: scale out |
vllm:num_requests_waiting | Waiting requests | > 10: scale out |
vllm:time_to_first_token_seconds | TTFT | P95 > 2s: action required |
vllm:num_preemptions_total | Preemption count | High indicates memory shortage |
vllm:avg_generation_throughput_toks_per_s | Generation throughput | Monitor vs baseline |
2-Tier Cost Tracking
Track both infrastructure and application levels for complete cost visibility.
- Bifrost (Infrastructure Level): Token unit price per model, team/project budget management, monthly cost reports
- Langfuse (Application Level): Token consumption per Agent workflow stage, chain end-to-end latency, Trace-based performance bottleneck analysis
This 2-Tier strategy enables simultaneous understanding of "which models were used how much" (infrastructure) and "which features drive costs" (application).
Bifrost → Bedrock Cascade Fallback
When self-hosted models (vLLM/llm-d) are overloaded or failing, Cascade Routing can be configured to automatically fallback to Amazon Bedrock's managed models. Bifrost (or LiteLLM) acts as the Gateway, switching requests to Bedrock on response failures/timeouts.
Bifrost Cascade Routing Configuration:
# bifrost-config.yaml
routing:
defaultModel: self-hosted-qwen3
strategy: cascade
cascadeOrder:
- self-hosted-qwen3 # Primary: EKS Self-hosted (cost optimized)
- self-hosted-glm5 # Secondary: EKS Self-hosted alternative
- bedrock-claude-sonnet # Tertiary: Bedrock managed (fallback)
fallbackConditions:
- statusCode: [500, 502, 503, 504]
- latencyMs: "> 30000" # Fallback if exceeding 30s
- errorRate: "> 0.1" # Fallback if error rate exceeds 10%
models:
- name: self-hosted-qwen3
provider: openai-compatible
baseUrl: http://inference-gateway.llm-d:8080/v1
model: Qwen/Qwen3-32B
priority: 1
costPer1kTokens: 0.001 # Self-hosted estimated cost
- name: self-hosted-glm5
provider: openai-compatible
baseUrl: http://glm5-service.agentic-serving:8000/v1
model: zai-org/GLM-5-FP8
priority: 2
costPer1kTokens: 0.003
- name: bedrock-claude-sonnet
provider: bedrock
model: anthropic.claude-sonnet-4-20250514
region: us-east-1
priority: 3
costPer1kTokens: 0.003 # Bedrock official pricing
maxTokens: 4096
Advantages of Cascade Routing:
| Perspective | Self-hosted Only | Cascade (Self-hosted + Bedrock) |
|---|---|---|
| Availability | Service interruption on GPU failure | Uninterrupted via Bedrock fallback |
| Cost | Fixed GPU cost | Regular Self-hosted (low cost) + peak Bedrock (pay-as-you-go) |
| Capacity Planning | Secure GPU for peak traffic | GPU for baseline traffic only, excess to Bedrock |
| Cold Start | Several minutes delay on Spot interruption | Bedrock immediate response |
Processing 80% of regular traffic with Self-hosted and offloading 20% peak to Bedrock eliminates the need to provision GPUs for peak capacity, achieving an additional 30-40% infrastructure cost reduction. Bedrock also serves as immediate backup during Spot instance interruptions.
Hybrid Node: On-Premises GPU Farm Integration
Overview
EKS Hybrid Node is a feature for registering on-premises GPU servers to EKS clusters (GA November 2024). Existing DGX and GPU servers can be integrated with cloud EKS to build hybrid Inference architecture.
Hybrid Node Registration
# 1. Create Hybrid Node IAM Role
aws iam create-role \
--role-name EKSHybridNodeRole \
--assume-role-policy-document file://hybrid-node-trust-policy.json
aws iam attach-role-policy \
--role-name EKSHybridNodeRole \
--policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
# 2. Register Hybrid Node from on-premises server
curl -o hybrid-node-installer.sh https://hybrid.eks.amazonaws.com/installer
chmod +x hybrid-node-installer.sh
sudo ./hybrid-node-installer.sh \
--cluster-name genai-platform \
--region us-west-2 \
--role-arn arn:aws:iam::123456789012:role/EKSHybridNodeRole \
--credential-provider ssm
# 3. Verify nodes
kubectl get nodes -l node.kubernetes.io/instance-type=hybrid
Hybrid Node GPU Operator Installation
On-premises nodes lack AWS-managed GPU stack, so GPU Operator is required.
# GPU Operator Helm Values (Hybrid Node dedicated)
driver:
enabled: true # On-premises: driver installation required
version: "580.126.18"
nodeSelector:
node.kubernetes.io/instance-type: hybrid
toolkit:
enabled: true
nodeSelector:
node.kubernetes.io/instance-type: hybrid
devicePlugin:
enabled: true # On-premises: Device Plugin installation required
nodeSelector:
node.kubernetes.io/instance-type: hybrid
dcgmExporter:
enabled: true
serviceMonitor:
enabled: true
additionalLabels:
location: on-premises # Separate on-premises/cloud metrics
nodeSelector:
node.kubernetes.io/instance-type: hybrid
3-Tier Cascade: On-Prem → Cloud → Bedrock
Combining Hybrid Node with Bifrost Cascade creates a 3-Tier architecture that maximizes both cost efficiency and availability.
| Tier | Infrastructure | Cost Structure | Role |
|---|---|---|---|
| Tier 1 | On-Prem Hybrid Node (DGX) | Fixed cost (already owned) | Handle baseline traffic (always active) |
| Tier 2 | Cloud GPU (EKS Spot/OD) | Variable cost (hourly) | Peak traffic bursts |
| Tier 3 | Amazon Bedrock | Pay-as-you-go (per token) | Failure/overload fallback |
# Bifrost 3-Tier Cascade configuration
routing:
strategy: cascade
cascadeOrder:
- onprem-dgx-llm # Primary: On-Prem (fixed cost, always active)
- cloud-eks-llm # Secondary: Cloud GPU (Spot, elastic)
- bedrock-fallback # Tertiary: Bedrock (pay-as-you-go, unlimited capacity)
models:
- name: onprem-dgx-llm
provider: openai-compatible
baseUrl: http://hybrid-node-vllm.inference:8000/v1
model: Qwen/Qwen3-32B
priority: 1
healthCheck:
endpoint: /health
intervalMs: 10000
- name: cloud-eks-llm
provider: openai-compatible
baseUrl: http://inference-gateway.llm-d:8080/v1
model: Qwen/Qwen3-32B
priority: 2
- name: bedrock-fallback
provider: bedrock
model: anthropic.claude-sonnet-4-20250514
region: us-east-1
priority: 3
Pod Placement Strategy: Workload Separation with nodeSelector
# Deploy on On-Prem Hybrid Node (baseline inference)
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-onprem
spec:
template:
spec:
nodeSelector:
node.kubernetes.io/instance-type: hybrid
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.3
args: ["Qwen/Qwen3-32B-FP8", "--gpu-memory-utilization=0.95"]
resources:
limits:
nvidia.com/gpu: 1
---
# Deploy on Cloud GPU Node (burst traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-cloud-burst
spec:
template:
spec:
nodeSelector:
karpenter.sh/nodepool: gpu-inference # Cloud Karpenter NodePool
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.3
args: ["Qwen/Qwen3-32B-FP8", "--gpu-memory-utilization=0.95"]
resources:
limits:
nvidia.com/gpu: 1
- Latency: 10-50ms additional delay via VPN/Direct Connect compared to cloud nodes
- Bandwidth: Multi-node NCCL communication requires high bandwidth → On-Prem internal PP is feasible, but On-Prem↔Cloud PP is not recommended
- Recommendation: On-Prem nodes serve independent models, connect with Cloud nodes via Cascade Routing at Gateway level
Lessons Learned: Large MoE Model Deployment
Image/Model Download Failure Mitigation
Large model (744GB+) weight download is the most common Cold Start bottleneck in LLM serving. Downloading hundreds of GB from HuggingFace Hub frequently fails due to network instability, timeouts, and disk shortage.
Problem Types and Responses
| Problem | Symptoms | Response |
|---|---|---|
| HF Hub Download Timeout | Pod CrashLoopBackOff, ConnectionError | Retry + resume support (HF_HUB_ENABLE_HF_TRANSFER=1) |
| Large File Partial Download | Corruption error during model loading | Checksum verification + re-download |
| Slow Container Image Pull | ImagePullBackOff, several minutes wait | Pre-cache images (Bottlerocket data volume, SOCI) |
| Multi-node Simultaneous Download | Network bandwidth contention | S3 caching + init container sequential loading |
| Slow EFS Download | 30+ minutes loading time | Switch to NVMe emptyDir |
Strategy 1: HuggingFace Transfer Acceleration
hf_transfer is a Rust-based high-speed download library, 3-5x faster than default download.
env:
- name: HF_HUB_ENABLE_HF_TRANSFER
value: "1"
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
# Download retry configuration
- name: HF_HUB_DOWNLOAD_TIMEOUT
value: "600" # 10 minute timeout
Strategy 2: S3 Pre-caching + Init Container
Most stable method. Pre-upload model weights to S3, copy to local NVMe in init container.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-with-s3-cache
spec:
template:
spec:
initContainers:
# Stage 1: Download model from S3 to NVMe
- name: model-downloader
image: amazon/aws-cli:latest
command: ["/bin/sh", "-c"]
args:
- |
echo "Checking local cache..."
if [ -f /models/config.json ]; then
echo "Model already cached, skipping download"
exit 0
fi
echo "Downloading model from S3..."
aws s3 sync s3://model-cache/qwen3-32b-fp8/ /models/ \
--no-progress \
--expected-size 65000000000
echo "Download complete, verifying..."
# Checksum verification
if [ -f /models/model.safetensors.index.json ]; then
echo "Model verified successfully"
else
echo "ERROR: Model incomplete, retrying..."
rm -rf /models/*
aws s3 sync s3://model-cache/qwen3-32b-fp8/ /models/
fi
volumeMounts:
- name: model-cache
mountPath: /models
resources:
requests:
cpu: 2
memory: 4Gi
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.3
args:
- /models
- "--gpu-memory-utilization=0.95"
volumeMounts:
- name: model-cache
mountPath: /models
volumes:
- name: model-cache
emptyDir:
sizeLimit: 200Gi # NVMe emptyDir
Strategy 3: Container Image Pre-caching
Methods to reduce Pull time for vLLM/SGLang images (10-20GB).
# Enable image pre-Pull in Karpenter NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-inference
spec:
template:
spec:
kubelet:
# Raise image GC threshold to maintain cache
imageGCHighThresholdPercent: 90
imageGCLowThresholdPercent: 85
Using SOCI (Seekable OCI) Index:
Creating SOCI index in ECR enables image lazy-loading via Pull, reducing container start time by 70-80%.
# Create SOCI index (ECR)
aws soci create \
--image-uri 123456789012.dkr.ecr.us-east-2.amazonaws.com/vllm:v0.6.3
# EKS Auto Mode automatically supports SOCI
# Karpenter: Native SOCI support when using Bottlerocket AMI
Strategy 4: Multi-node LWS Model Download Coordination
When deploying with LWS multi-node, network contention occurs if Leader and Worker simultaneously download the same model.
# Leader Pod: Download from S3 then cache to NVMe
initContainers:
- name: model-downloader
command: ["/bin/sh", "-c"]
args:
- |
# Only Leader downloads from S3
aws s3 sync s3://model-cache/glm5-fp8/ /models/
echo "READY" > /models/.download-complete
# Worker Pod: Wait for Leader completion then download independently
initContainers:
- name: model-downloader
command: ["/bin/sh", "-c"]
args:
- |
# Worker downloads independently from S3
# (NVMe emptyDir is node-independent, cannot share)
aws s3 sync s3://model-cache/glm5-fp8/ /models/
| Method | 744GB Model Time | Stability | Cost |
|---|---|---|---|
| HF Hub Direct | 20-40min | Frequent timeouts | Free |
| HF Hub + hf_transfer | 10-15min | Good | Free |
| S3 Pre-caching | 5-10min | Very Stable | S3 Storage Cost |
| FSx for Lustre | 5-8min | Stable | High |
| NVMe Local Cache (Restart) | < 1min | Best | Free |
EKS Auto Mode GPU Limitations
Core limitations identified during GLM-5 (744B MoE) and Kimi K2.5 (1T MoE) deployments.
p6-b200 Not Supported
As of April 2026, EKS Auto Mode's managed Karpenter cannot provision p6-b200.48xlarge. NodePool validation passes but actual NodeClaim creation fails with NoCompatibleInstanceTypes error.
GPU Instance Capacity Acquisition
p5.48xlarge frequently has InsufficientCapacity in Seoul/Tokyo regions. Available in us-east-2 (Ohio) Spot for $13-15/hr (85% reduction vs On-Demand $98/hr).
| Region | p5.48xlarge On-Demand | p5.48xlarge Spot | Spot Price |
|---|---|---|---|
| ap-northeast-2 (Seoul) | InsufficientCapacity | Unconfirmed | — |
| ap-northeast-1 (Tokyo) | InsufficientCapacity | Unconfirmed | — |
| us-east-2 (Ohio) | Variable availability | Available | $13~15/hr |
GPU Operator Conflict
Installing GPU Operator with devicePlugin.enabled=true conflicts with Auto Mode's built-in Device Plugin, resulting in allocatable=0. Must install with devicePlugin.enabled=false.
Cannot Directly Terminate EC2 Instances
Auto Mode managed nodes block ec2:TerminateInstances via resource-based policy. Node cleanup must be performed indirectly through Karpenter NodePool deletion or Pod removal.
Serving Framework Compatibility
| Model | vLLM Support | SGLang Support | Notes |
|---|---|---|---|
| Qwen3-32B | Supported | Supported | llm-d default model, Apache 2.0 |
| Kimi K2.5 (1T MoE) | Supported | Supported | INT4 W4A16 Marlin MoE, gpu_memory_utilization=0.85 |
| GLM-5 (744B MoE) | Not supported | Supported | glm_moe_dsa architecture → requires transformers v5.2+, vLLM uses v4.x |
| DeepSeek V3.2 | Supported | Supported | MoE, 671B/37B active |
GLM-5 is not supported in vLLM. Must use SGLang-dedicated image (lmsysorg/sglang:glm5-hopper), and configure --pp-size 2 --nnodes 2 --dist-init-addr <leader>:5000 for multi-node deployment.
Storage Strategy
Storage performance is critical for large model (744GB+) weight loading.
| Storage | Sequential Read | Multi-node Sharing | Recommended Scenario |
|---|---|---|---|
| NVMe emptyDir | ~3,500 MB/s | Node-independent | p5 built-in NVMe, best performance |
| EFS | ~100-300 MB/s | ReadWriteMany | Small models, when sharing needed |
| S3 + init container | ~1,000 MB/s | S3 shared | Medium performance, cost efficient |
| FSx for Lustre | ~1,000+ MB/s | ReadWriteMany | Training workloads |
Large models like GLM-5 (744GB) and Kimi K2.5 (630GB) recommend local NVMe (emptyDir). p5.48xlarge has 8×3.84TB NVMe SSD built-in, providing best performance at no additional cost. First startup takes 10-20min with HuggingFace Hub direct download, but subsequent loads are fast.
GPU Quota Pitfall
EC2 vCPU quotas are separated by instance bucket, requiring caution.
| Quota | Applicable Instances | Default | Caution |
|---|---|---|---|
| Running On-Demand P instances | p4d, p5, p5en | 384 | Can have 2 p5.48xlarge (192 vCPU each) |
| Running On-Demand G and VT instances | g5, g6, g6e | 64 | Cannot even have 1 g6e.48xlarge → quota increase required |
Setting instance-category: [g, p] together in GPU NodePool may cause Karpenter to try G types first, hitting the G quota (64 vCPU). If only P types are needed, specify explicitly.
References
Official Documentation
- KEDA Documentation — Kubernetes Event-driven Autoscaling
- Karpenter Documentation — Node auto-provisioning, Disruption, Consolidation
- EKS Hybrid Nodes — On-premises GPU farm integration
- NVIDIA DCGM Exporter — GPU sensor metrics collection
- Langfuse Self-hosted — Agent observability OSS
Papers & Technical Blogs
- a16z "The Economics of AI" — GPU cost structure analysis
- AWS Bottlerocket & SOCI — Container image lazy-loading
- Spot Instance Operations Guide (AWS) — Karpenter Spot interruption response
- NVIDIA Triton & DCGM Metrics Guide — GPU metrics interpretation
Related Documentation
- Inference Optimization on EKS (Overview) — Inference optimization category entry point
- KV Cache Optimization (vLLM Deep Dive + Cache-Aware Routing) — vLLM/llm-d/Dynamo deep dive
- Disaggregated Serving + LWS Multi-Node — Prefill/Decode separation, LWS deployment
- GPU Resource Management — GPU scaling, DRA
- NVIDIA GPU Software Stack — GPU Operator, DCGM
- Agent Monitoring (Langfuse Canonical) — Langfuse-based Agent observability