Skip to main content

llm-d Based EKS Distributed Inference Guide

Current Version: llm-d v0.5+ (2026.03)

Overview

llm-d is an Apache 2.0-licensed Kubernetes-native distributed inference stack led by Red Hat. It combines the vLLM inference engine, Envoy-based Inference Gateway, and Kubernetes Gateway API to provide intelligent inference routing for large language models.

While existing vLLM deployments rely on simple Round-Robin load balancing, llm-d delivers intelligent routing that is KV Cache state-aware, forwarding requests with identical prefixes to Pods that already hold the corresponding KV Cache. This significantly reduces Time To First Token (TTFT) and saves GPU computation.

Production Deployment Guide

For llm-d EKS deployment YAML, helmfile commands, and cluster creation, see the Custom Model Deployment Guide.

llm-d Inference Gateway =/= General-purpose Gateway API Implementation

llm-d's Envoy-based Inference Gateway is a special-purpose gateway designed exclusively for LLM inference requests.

  • llm-d Gateway: InferenceModel/InferencePool CRD-based, KV Cache-aware routing, inference traffic only
  • General Gateway API: HTTPRoute/GRPCRoute-based, TLS/auth/Rate Limiting, cluster-wide traffic management

In production, the recommended architecture has a general Gateway API implementation handling the cluster entry point, with llm-d optimizing AI inference traffic underneath.

llm-d's 3 Well-Lit Paths

llm-d provides three validated deployment paths.

llm-d's 3 Well-Lit Paths
Intelligent Inference SchedulingRECOMMENDED
Intelligent request distribution with KV Cache-aware routing
📌 General-purpose LLM serving (this guide)
Prefill/Decode Disaggregation
Separates Prefill and Decode stages for processing
📌 Large batch processing, long context handling
Wide Expert-Parallelism
Distributes MoE model Experts across multiple nodes
📌 MoE models (Mixtral, DeepSeek, etc.)

Architecture

llm-d's Intelligent Inference Scheduling architecture is composed as follows.

llm-d vs Traditional vLLM Deployment Comparison

llm-d vs Traditional vLLM Deployment Comparison
FeatureTraditional vLLM Deploymentllm-d Deployment
Routing MethodRound-Robin / RandomKV Cache-aware Intelligent Routing
Gateway IntegrationSeparate Ingress/Service configurationNative Gateway API integration
Scaling ManagementManual HPA configurationAutomatic management via InferencePool
KV Cache UtilizationIndependent management per PodCross-pod prefix reuse for reduced TTFT
Installation MethodCombining individual Helm chartsUnified helmfile deployment (single command)
Model DefinitionWriting Deployment YAML directlyDeclarative management via InferenceModel CRD

Gateway API CRD

llm-d uses Kubernetes Gateway API and Inference Extension CRDs.

Installed CRDs
Gateway
Defines Envoy-based proxy instances
HTTPRoute
Defines routing rules
InferencePool
Defines vLLM Pod groups (serving endpoint pools)
InferenceModel
Maps model names to InferencePools

Default Deployment Configuration

Default Deployment Configuration
SettingDefault ValueDescription
ModelQwen/Qwen3-32BApache 2.0, BF16 ~65GB VRAM
vLLM Versionv0.6+CUDA 12.x support, H100/H200 optimized
Tensor ParallelismTP=22 GPUs per replica
Replicas816 GPUs total (2× p5.48xlarge)
Max Model Length32,768Maximum context length
GPU Memory Utilization0.90KV Cache allocation ratio

Qwen3-32B Model Selection Rationale

Why Qwen3-32B Was Selected
Model Name
Qwen/Qwen3-32B
Parameters
32B (Dense)
License
Apache 2.0
Precision
BF16 (~65GB VRAM)
Context
Up to 32,768 tokens
Features
Official default model for llm-d, excellent multilingual support, most popular among open-source LLMs
Qwen3-32B Selection Background

Qwen3-32B is llm-d's official default model and is Apache 2.0-licensed for free commercial use. Requiring ~65GB VRAM at BF16, it can be stably served with TP=2 (2x GPU) on H100 80GB.


KV Cache-aware Routing

The core differentiator of llm-d is intelligent routing that is aware of KV Cache state.

Routing Operation Principles

  1. Request reception: Client sends inference request to Inference Gateway
  2. Prefix analysis: Gateway hashes the request's prompt prefix for identification
  3. Cache lookup: Checks KV Cache state of each vLLM Pod to find Pods holding the prefix
  4. Intelligent routing: Routes to matching Pod on cache hit; load-balanced on miss
  5. Response return: vLLM returns inference results to client via Gateway

KV Cache-aware Routing Effects

Effects of KV Cache-aware Routing
MetricCache Miss (Traditional)Cache Hit (llm-d)Improvement
TTFT (Time To First Token)High (full prefill required)Low (prefill skipped)50-80% reduction
GPU ComputationFull prompt processingOnly new tokens processedComputation savings
ThroughputBaselineImproved1.5-3x improvement
Maximizing Cache Hit Rate

KV Cache-aware routing is most effective in applications using identical system prompts. For example, in RAG pipelines that repeatedly reference the same context documents, reusing the prefix's KV Cache can significantly reduce TTFT.


EKS Auto Mode Integration

Auto Mode Advantages and Limitations

Advantages:

  • Automatic GPU driver management: AWS automatically installs and updates NVIDIA GPU drivers
  • Automatic NodeClass selection: Using default NodeClass lets Auto Mode auto-select optimal AMI and driver version
  • Operational simplification: Eliminates driver installation, CUDA version management, and driver compatibility verification burden
  • GPU Operator installable: Only Device Plugin disabled via label; DCGM/NFD/GFD operate normally

Limitations:

  • MIG/Time-Slicing not available: Auto Mode's NodeClass is AWS-managed (read-only), so GPU partitioning configuration is not possible
  • Custom AMI not available: Cannot pin specific CUDA versions or drivers

Auto Mode vs Karpenter + GPU Operator Comparison

Auto Mode is suitable for large model serving without GPU driver management burden, while Karpenter is advantageous for workloads requiring advanced GPU features like MIG/Time-Slicing.

Detailed comparison and cost analysis: See EKS GPU Node Strategy — Node Type Comparison

GPU Instance Specifications

p5.48xlarge Instance Specifications
GPU
8× NVIDIA H100 80GB HBM3
GPU Memory
640GB total
vCPU
192
System Memory
2,048 GiB
GPU Interconnect
NVSwitch (900 GB/s)
Network
EFA 3,200 Gbps
Storage
8× 3.84TB NVMe SSD
p5e.48xlarge Instance Specifications (H200)
GPU
8× NVIDIA H200 141GB HBM3
GPU Memory
1,128GB total
vCPU
192
System Memory
2,048 GiB
GPU Interconnect
NVSwitch (900 GB/s)
Network
EFA 3,200 Gbps
Storage
8× 3.84TB NVMe SSD
Instance Selection Guide
  • p5e.48xlarge (H200): 100B+ parameter models, maximum memory utilization
  • p5.48xlarge (H100): 70B+ parameter models, highest performance
  • g6e family (L40S): 13B-70B models, cost-efficient inference
llm-d + DRA Karpenter/Auto Mode Constraints

When llm-d ModelService requests GPUs via DRA (ResourceClaim), node provisioning does not work on Karpenter and EKS Auto Mode. DRA workloads require Managed Node Group + Cluster Autoscaler configuration.

Details: EKS GPU Node Strategy — MNG Hybrid for DRA Workloads


llm-d v0.5+ Key Features

FeatureDescriptionStatus
Prefill/Decode DisaggregationSeparate Prefill and Decode into distinct Pod groups, maximizing throughput for large batches and long contextsGA
Expert ParallelismDistributed serving of MoE model (Mixtral, DeepSeek) Experts across multiple nodesGA
LoRA Adapter Hot-swapDynamically load/unload multiple LoRA adapters on a single base modelGA
Multi-model ServingSimultaneously serve multiple models via InferenceModel CRD in a single clusterGA
Gateway API Inference ExtensionK8s-native routing based on InferencePool/InferenceModel CRDsGA

Disaggregated Serving Concept

Disaggregated Serving separates the two phases of LLM inference for independent optimization:

PhaseCharacteristicsOptimization Direction
PrefillProcesses entire prompt at once (compute-bound)GPU computing focused, high TP
DecodeAutoregressive token-by-token generation (memory-bound)GPU memory focused, low TP

NIXL (NVIDIA Inference Xfer Library): Common KV transfer engine used by most projects including Dynamo, llm-d, production-stack, and aibrix. Transfers KV Cache at ultra-high speed via direct GPU communication (NVLink/RDMA).

Disaggregated Serving on EKS Auto Mode

Since MIG partitioning is not possible on Auto Mode, Prefill/Decode roles are separated at the instance (node) level.

Prefill NodePool (compute-heavy):
p5.48xlarge x N → Prefill Pod (each TP=4, 4 GPUs)

Decode NodePool (memory-heavy):
p5.48xlarge x N → Decode Pod (each TP=2, 2 GPUs x 4 Pods/node)
ItemAuto Mode (Node Separation)Karpenter + GPU Operator (MIG Separation)
Separation UnitInstance (node)GPU unit (MIG partition)
GPU UtilizationOptimizable with Decode Pod TP=2 x 4/nodeHigh utilization with intra-GPU MIG partitioning
Operational ComplexityLowMedium (GPU Operator + MIG configuration)
ScalingEasy independent Prefill/Decode scalingNode-level MIG reconfiguration causes disruption
Minimizing GPU Idle

Recommended strategy: Validate on Auto Mode first, then transition to Karpenter + GPU Operator + MIG when cost optimization is needed.


llm-d vs NVIDIA Dynamo

llm-d and NVIDIA Dynamo both provide LLM inference routing/scheduling but with different approaches. For detailed comparison, see NVIDIA GPU Stack — llm-d vs Dynamo Selection Guide.

Itemllm-dNVIDIA Dynamo
LeadRed Hat (Apache 2.0)NVIDIA (Apache 2.0)
ArchitectureAggregated + DisaggregatedAggregated + Disaggregated (equal support)
KV Cache TransferNIXL (network supported)NIXL (NVLink/RDMA ultra-fast)
KV Cache IndexingPrefix-aware routingFlash Indexer (radix tree-based)
RoutingGateway API + Envoy EPPDynamo Router + custom EPP (Gateway API integration)
Pod SchedulingK8s default schedulerKAI Scheduler (GPU-aware Pod placement)
AutoscalingHPA/KEDA integrationPlanner (SLO-based: profiling → autoscale) + KEDA/HPA
GPU Operator RequiredOptional (Auto Mode compatible)Required (KAI Scheduler's ClusterPolicy dependency)
ComplexityLowHigh
StrengthsK8s native, lightweight, fast adoptionFlash Indexer, KAI Scheduler, Planner SLO autoscaling
Selection Guide
  • EKS Auto Mode + quick start: llm-d (GPU Operator optional)
  • Small-medium scale (16 GPUs or less): llm-d
  • Large scale (16+ GPUs), maximum throughput: Dynamo (Flash Indexer + Planner)
  • Long context (128K+): Dynamo (3-tier KV Cache: GPU→CPU→SSD)
  • K8s Gateway API standard compliance: llm-d

Starting with llm-d and transitioning to Dynamo as scale grows is practical. Dynamo 1.0 can integrate llm-d as an internal component, making it more of a superset than a complete alternative.

Migration Path

Phased transition path:

PhaseConfigurationSuitable For
Phase 1Auto Mode + llm-dPoC, dev environments, 16 GPUs or less
Phase 1.5Auto Mode + GPU Operator + llm-dEnhanced monitoring/scheduling
Phase 2aKarpenter + llm-d DisaggregatedMid-scale production, MIG utilization
Phase 2bMNG + DRA + llm-dP6e-GB200, DRA-required environments
Phase 3Karpenter + DynamoLarge scale (16+ GPUs), maximum performance
Transition Notes

Auto Mode and self-managed Karpenter can coexist in the same cluster. In Phase 1.5, add the nvidia.com/gpu.deploy.device-plugin: "false" label to Auto Mode NodePool to prevent Device Plugin conflicts.


Monitoring

Key Monitoring Metrics

Key Monitoring Metrics
MetricDescriptionNormal Range
vllm_num_requests_runningNumber of currently processing requestsVaries by workload
vllm_num_requests_waitingNumber of waiting requests< 50
vllm_gpu_cache_usage_percGPU KV Cache utilization60-90%
vllm_avg_generation_throughput_toks_per_sTokens generated per secondVaries by model/GPU
vllm_avg_prompt_throughput_toks_per_sPrompt tokens processed per secondVaries by model/GPU
vllm_e2e_request_latency_secondsEnd-to-end request latencyP95 < 30s

Model Loading Time

Expected Time by Model Loading Method
Loading MethodExpected TimeNotes
HuggingFace Hub (initial)10-20 minVaries by network speed
S3 Cache3-5 minLoading from same-region S3
Node Local Cache1-2 minWhen redeploying on the same node

Cost Optimization

Cost Optimization Strategies
StrategyDescriptionEstimated Savings
Savings Plans1-year/3-year Compute Savings Plans commitment30-60%
Off-Peak Scale DownReduce replicas during nights/weekends (using CronJob)40-60%
Model QuantizationReduce GPU count with INT8/INT450% GPU cost
Spot InstancesApply to fault-tolerant workloads (risk of interruption)60-90%
TP OptimizationUse minimum TP value appropriate for model sizeAvoid unnecessary GPUs
Cost Caution

p5.48xlarge costs approximately $98.32/hr (us-west-2 On-Demand). Running 2 instances costs ~$141,580/month. Always clean up resources after testing.


EKS Auto Mode GPU Instance Support Status (Verified 2026.04)

Instance Support Matrix

Instance TypeGPUVRAM (Total)Auto Mode SupportVerification Status
g5.xlarge~48xlargeA10G24~192GBNormalProvisioning confirmed
g6.xlarge~48xlargeL424~192GBNormalProvisioning confirmed
g6e.xlarge~48xlargeL40S48~384GBNormalProvisioning confirmed
p4d.24xlargeA100 40GB x 8320GBNormalDry-run confirmed
p5.48xlargeH100 80GB x 8640GBNormalSpot provisioning confirmed (us-east-2)
p5en.48xlargeH200 141GB x 81,128GBLimitedDry-run passes, offering matching may fail
p6-b200.48xlargeB200 192GB x 81,536GBNot supportedNoCompatibleInstanceTypes error
p6 Instance Not Supported

As of April 2026, EKS Auto Mode's managed Karpenter cannot provision p6-b200.48xlarge. Use EKS Standard Mode + Karpenter if p6 instances are needed.

Per-Region GPU Capacity Availability

Regionp5.48xlarge On-Demandp5.48xlarge SpotSpot Price
ap-northeast-2 (Seoul)InsufficientCapacityUnconfirmed--
us-east-2 (Ohio)Availability variesSuccessfully acquired$13-15/hr

Spot Price Comparison (us-east-2, 2026.04): p5 instances offer 85-90% cost savings on Spot. For detailed pricing, see EKS GPU Node Strategy — Spot Price Comparison.

GPU Quota Notes

Quota NameApplicable InstancesDefault
Running On-Demand P instancesp4d, p4de, p5, p5en384
Running On-Demand G and VT instancesg5, g6, g6e64
G Instance Quota Trap

When setting instance-category: [g, p] together in GPU NodePool, Karpenter may try G-type instances first. To use P-type only, explicitly specify instance-category: [p].


Next Steps


References