Skip to main content

Inference Optimization on EKS

Overview

In production LLM services, Inference costs account for 80-90% of total AI operational expenses (a16z "The Economics of AI", NVIDIA GTC 2024, SemiAnalysis). Training is a one-time operation, but inference runs 24/7 as long as the service is live. GPU time translates directly to cost, with a single p5.48xlarge (H100×8) On-Demand instance costing $98/hour. Operating two nodes monthly amounts to approximately $141,580.

This document consolidates architectural patterns for maximizing LLM Inference performance on EKS, based on lessons learned from building a telecommunications carrier's Agentic AI platform and deployment cases of large MoE models such as GLM-5 (744B) and Kimi K2.5 (1T).

Covered Content

This category consists of three deep-dive documents.

Key Topics by Document

  1. EKS GPU Infrastructure Strategy — Auto Mode vs Karpenter vs MNG selection criteria (this document)
  2. Model Serving Engine — vLLM core technologies and GPU memory design (KV Cache Optimization)
  3. KV Cache-Aware Routing — Comparison of llm-d and NVIDIA Dynamo (KV Cache Optimization)
  4. Disaggregated Serving — Prefill/Decode separation architecture (Disaggregated Serving)
  5. LWS Multi-Node Serving — LeaderWorkerSet-based 700B+ model deployment (Disaggregated Serving)
  6. GPU Resource Management — 2-Tier autoscaling and DRA (Cost · Observability · Hybrid)
  7. Observability & Fallback — GPU monitoring, Bifrost→Bedrock fallback (Cost · Observability · Hybrid)
  8. Hybrid Node — On-premises GPU farm integration with EKS (Cost · Observability · Hybrid)
  9. Lessons Learned — Image download failure mitigation, large MoE deployment pitfalls (Cost · Observability · Hybrid)

Key Performance Metrics

MetricDescriptionOptimization Target
TTFT (Time to First Token)Time to generate first token< 2s (conversational), < 5s (batch)
TPS (Tokens per Second)Token generation rateVaries by model
GPU UtilizationGPU compute utilization> 70%
KV Cache Hit RateKV cache reuse ratio> 60% (shared prompts)
P99 Latency99th percentile response timeAdhere to SLO requirements

EKS GPU Infrastructure Strategy

Three Deployment Model Comparison

When running GPU workloads on EKS, capabilities and operational complexity vary significantly depending on node management approach.

CriteriaEKS Auto ModeKarpenter + GPU OperatorMNG + Cluster Autoscaler
GPU Driver ManagementAWS managedPre-installed in AMIPre-installed in AMI
MIG / Time-SlicingNot possibleSupportedSupported
DRA CompatibilityNot supportedNot supportedOnly option
DCGM MonitoringPossible with GPU OperatorFully supportedFully supported
Operational ComplexityLowMediumMedium
Suitable Model Size70B+ (full GPU utilization)7B~700B+ (MIG partitioning)DRA-required workloads
Selection Guide
  • Quick Start / PoC: Auto Mode — Automatic GPU driver and Device Plugin management
  • Production (Fine GPU Control): Karpenter + GPU Operator — MIG and Custom AMI support
  • When DRA Required: MNG + Cluster Autoscaler — Architectural limitation where Karpenter/Auto Mode skips DRA Pods

GPU Instance Selection Matrix

InstanceGPUGPU Memory (Total)Suitable Model SizeHourly Cost (On-Demand)
g5.xlarge~48xlargeA10G24~192GB≤7B$1.01~$16.29
g6e.xlarge~48xlargeL40S48~384GB13B~70BCost-effective
p4d.24xlargeA100 40GB × 8320GB13B~70B$32.77
p5.48xlargeH100 80GB × 8640GB70B~700B+$98.32
p5e.48xlargeH200 141GB × 81,128GB100B+Maximum memory

Auto Mode GPU Operator Hybrid Configuration

GPU Operator can be installed on Auto Mode. Disable only the Device Plugin via node labels, while DCGM Exporter, NFD, and GFD operate normally.

# GPU Operator installation (Auto Mode compatible)
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace \
--set driver.enabled=false \
--set toolkit.enabled=false

# Add Device Plugin disable label to NodePool
# nvidia.com/gpu.deploy.device-plugin: "false"

This maintains Auto Mode convenience while collecting granular DCGM metrics (SM utilization, NVLink bandwidth). ClusterPolicy-dependent projects like KAI Scheduler are also usable.

GPU Operator + Auto Mode Caution

Installing with devicePlugin.enabled=true conflicts with Auto Mode's built-in Device Plugin, resulting in allocatable=0. Must disable with devicePlugin.enabled=false or via node labels.

Decision Flow

TierModel ScaleInfrastructureServing EngineRoutingExamples
Tier 1≤32BAuto Mode, g6e/p5vLLM (Single GPU)Round-RobinQwen3-32B FP8
Tier 270B~200BKarpenter + GPU OperatorvLLM TP=4~8llm-d KV Cache-awareLlama-3.3-70B
Tier 3700B+ MoEMNG or Karpenter + LWSvLLM/SGLang PP+TPDisaggregated + NIXLGLM-5, Kimi K2.5

Common to All Tiers: Bifrost Cascade Routing with Bedrock fallback recommended (uninterrupted service during GPU failures/Spot interruptions)

Hybrid Architecture: Complete Picture

Migration Path

Phased transitions minimize operational risk while progressively improving performance.

Phase 1: Auto Mode + vLLM + Bifrost→Bedrock fallback → PoC, dev environments

Phase 1.5: Auto Mode + GPU Operator + llm-d → Enhanced monitoring, KV Cache routing

Phase 2: Karpenter + llm-d Disaggregated + LWS multi-node → MIG, Prefill/Decode separation

Phase 3: Karpenter + Dynamo + Hybrid Node → On-premises integration, 3-Tier Cascade

Phase 4: Full integration → On-Prem→Cloud→Bedrock Cascade, SLO-based autoscaling

References

Official Documentation

Papers & Technical Blogs