llm-d Based EKS Distributed Inference Guide
Current Version: llm-d v0.5+ (2026.03)
Overview
llm-d is an Apache 2.0-licensed Kubernetes-native distributed inference stack led by Red Hat. It combines the vLLM inference engine, Envoy-based Inference Gateway, and Kubernetes Gateway API to provide intelligent inference routing for large language models.
While existing vLLM deployments rely on simple Round-Robin load balancing, llm-d delivers intelligent routing that is KV Cache state-aware, forwarding requests with identical prefixes to Pods that already hold the corresponding KV Cache. This significantly reduces Time To First Token (TTFT) and saves GPU computation.
For llm-d EKS deployment YAML, helmfile commands, and cluster creation, see the Custom Model Deployment Guide.
llm-d's Envoy-based Inference Gateway is a special-purpose gateway designed exclusively for LLM inference requests.
- llm-d Gateway: InferenceModel/InferencePool CRD-based, KV Cache-aware routing, inference traffic only
- General Gateway API: HTTPRoute/GRPCRoute-based, TLS/auth/Rate Limiting, cluster-wide traffic management
In production, the recommended architecture has a general Gateway API implementation handling the cluster entry point, with llm-d optimizing AI inference traffic underneath.
llm-d's 3 Well-Lit Paths
llm-d provides three validated deployment paths.
Architecture
llm-d's Intelligent Inference Scheduling architecture is composed as follows.
llm-d vs Traditional vLLM Deployment Comparison
| Feature | Traditional vLLM Deployment | llm-d Deployment ✨ |
|---|---|---|
| Routing Method | Round-Robin / Random | KV Cache-aware Intelligent Routing |
| Gateway Integration | Separate Ingress/Service configuration | Native Gateway API integration |
| Scaling Management | Manual HPA configuration | Automatic management via InferencePool |
| KV Cache Utilization | Independent management per Pod | Cross-pod prefix reuse for reduced TTFT |
| Installation Method | Combining individual Helm charts | Unified helmfile deployment (single command) |
| Model Definition | Writing Deployment YAML directly | Declarative management via InferenceModel CRD |
Gateway API CRD
llm-d uses Kubernetes Gateway API and Inference Extension CRDs.
Default Deployment Configuration
| Setting | Default Value | Description |
|---|---|---|
| Model | Qwen/Qwen3-32B | Apache 2.0, BF16 ~65GB VRAM |
| vLLM Version | v0.6+ | CUDA 12.x support, H100/H200 optimized |
| Tensor Parallelism | TP=2 | 2 GPUs per replica |
| Replicas | 8 | 16 GPUs total (2× p5.48xlarge) |
| Max Model Length | 32,768 | Maximum context length |
| GPU Memory Utilization | 0.90 | KV Cache allocation ratio |
Qwen3-32B Model Selection Rationale
Qwen3-32B is llm-d's official default model and is Apache 2.0-licensed for free commercial use. Requiring ~65GB VRAM at BF16, it can be stably served with TP=2 (2x GPU) on H100 80GB.
KV Cache-aware Routing
The core differentiator of llm-d is intelligent routing that is aware of KV Cache state.
Routing Operation Principles
- Request reception: Client sends inference request to Inference Gateway
- Prefix analysis: Gateway hashes the request's prompt prefix for identification
- Cache lookup: Checks KV Cache state of each vLLM Pod to find Pods holding the prefix
- Intelligent routing: Routes to matching Pod on cache hit; load-balanced on miss
- Response return: vLLM returns inference results to client via Gateway
KV Cache-aware Routing Effects
| Metric | Cache Miss (Traditional) | Cache Hit (llm-d) | Improvement |
|---|---|---|---|
| TTFT (Time To First Token) | High (full prefill required) | Low (prefill skipped) | 50-80% reduction |
| GPU Computation | Full prompt processing | Only new tokens processed | Computation savings |
| Throughput | Baseline | Improved | 1.5-3x improvement |
KV Cache-aware routing is most effective in applications using identical system prompts. For example, in RAG pipelines that repeatedly reference the same context documents, reusing the prefix's KV Cache can significantly reduce TTFT.
EKS Auto Mode Integration
Auto Mode Advantages and Limitations
Advantages:
- Automatic GPU driver management: AWS automatically installs and updates NVIDIA GPU drivers
- Automatic NodeClass selection: Using
defaultNodeClass lets Auto Mode auto-select optimal AMI and driver version - Operational simplification: Eliminates driver installation, CUDA version management, and driver compatibility verification burden
- GPU Operator installable: Only Device Plugin disabled via label; DCGM/NFD/GFD operate normally
Limitations:
- MIG/Time-Slicing not available: Auto Mode's NodeClass is AWS-managed (read-only), so GPU partitioning configuration is not possible
- Custom AMI not available: Cannot pin specific CUDA versions or drivers
Auto Mode vs Karpenter + GPU Operator Comparison
Auto Mode is suitable for large model serving without GPU driver management burden, while Karpenter is advantageous for workloads requiring advanced GPU features like MIG/Time-Slicing.
Detailed comparison and cost analysis: See EKS GPU Node Strategy — Node Type Comparison
GPU Instance Specifications
- p5e.48xlarge (H200): 100B+ parameter models, maximum memory utilization
- p5.48xlarge (H100): 70B+ parameter models, highest performance
- g6e family (L40S): 13B-70B models, cost-efficient inference
When llm-d ModelService requests GPUs via DRA (ResourceClaim), node provisioning does not work on Karpenter and EKS Auto Mode. DRA workloads require Managed Node Group + Cluster Autoscaler configuration.
Details: EKS GPU Node Strategy — MNG Hybrid for DRA Workloads
llm-d v0.5+ Key Features
| Feature | Description | Status |
|---|---|---|
| Prefill/Decode Disaggregation | Separate Prefill and Decode into distinct Pod groups, maximizing throughput for large batches and long contexts | GA |
| Expert Parallelism | Distributed serving of MoE model (Mixtral, DeepSeek) Experts across multiple nodes | GA |
| LoRA Adapter Hot-swap | Dynamically load/unload multiple LoRA adapters on a single base model | GA |
| Multi-model Serving | Simultaneously serve multiple models via InferenceModel CRD in a single cluster | GA |
| Gateway API Inference Extension | K8s-native routing based on InferencePool/InferenceModel CRDs | GA |
Disaggregated Serving Concept
Disaggregated Serving separates the two phases of LLM inference for independent optimization:
| Phase | Characteristics | Optimization Direction |
|---|---|---|
| Prefill | Processes entire prompt at once (compute-bound) | GPU computing focused, high TP |
| Decode | Autoregressive token-by-token generation (memory-bound) | GPU memory focused, low TP |
NIXL (NVIDIA Inference Xfer Library): Common KV transfer engine used by most projects including Dynamo, llm-d, production-stack, and aibrix. Transfers KV Cache at ultra-high speed via direct GPU communication (NVLink/RDMA).
Disaggregated Serving on EKS Auto Mode
Since MIG partitioning is not possible on Auto Mode, Prefill/Decode roles are separated at the instance (node) level.
Prefill NodePool (compute-heavy):
p5.48xlarge x N → Prefill Pod (each TP=4, 4 GPUs)
Decode NodePool (memory-heavy):
p5.48xlarge x N → Decode Pod (each TP=2, 2 GPUs x 4 Pods/node)
| Item | Auto Mode (Node Separation) | Karpenter + GPU Operator (MIG Separation) |
|---|---|---|
| Separation Unit | Instance (node) | GPU unit (MIG partition) |
| GPU Utilization | Optimizable with Decode Pod TP=2 x 4/node | High utilization with intra-GPU MIG partitioning |
| Operational Complexity | Low | Medium (GPU Operator + MIG configuration) |
| Scaling | Easy independent Prefill/Decode scaling | Node-level MIG reconfiguration causes disruption |
Recommended strategy: Validate on Auto Mode first, then transition to Karpenter + GPU Operator + MIG when cost optimization is needed.
llm-d vs NVIDIA Dynamo
llm-d and NVIDIA Dynamo both provide LLM inference routing/scheduling but with different approaches. For detailed comparison, see NVIDIA GPU Stack — llm-d vs Dynamo Selection Guide.
| Item | llm-d | NVIDIA Dynamo |
|---|---|---|
| Lead | Red Hat (Apache 2.0) | NVIDIA (Apache 2.0) |
| Architecture | Aggregated + Disaggregated | Aggregated + Disaggregated (equal support) |
| KV Cache Transfer | NIXL (network supported) | NIXL (NVLink/RDMA ultra-fast) |
| KV Cache Indexing | Prefix-aware routing | Flash Indexer (radix tree-based) |
| Routing | Gateway API + Envoy EPP | Dynamo Router + custom EPP (Gateway API integration) |
| Pod Scheduling | K8s default scheduler | KAI Scheduler (GPU-aware Pod placement) |
| Autoscaling | HPA/KEDA integration | Planner (SLO-based: profiling → autoscale) + KEDA/HPA |
| GPU Operator Required | Optional (Auto Mode compatible) | Required (KAI Scheduler's ClusterPolicy dependency) |
| Complexity | Low | High |
| Strengths | K8s native, lightweight, fast adoption | Flash Indexer, KAI Scheduler, Planner SLO autoscaling |
- EKS Auto Mode + quick start: llm-d (GPU Operator optional)
- Small-medium scale (16 GPUs or less): llm-d
- Large scale (16+ GPUs), maximum throughput: Dynamo (Flash Indexer + Planner)
- Long context (128K+): Dynamo (3-tier KV Cache: GPU→CPU→SSD)
- K8s Gateway API standard compliance: llm-d
Starting with llm-d and transitioning to Dynamo as scale grows is practical. Dynamo 1.0 can integrate llm-d as an internal component, making it more of a superset than a complete alternative.
Migration Path
Phased transition path:
| Phase | Configuration | Suitable For |
|---|---|---|
| Phase 1 | Auto Mode + llm-d | PoC, dev environments, 16 GPUs or less |
| Phase 1.5 | Auto Mode + GPU Operator + llm-d | Enhanced monitoring/scheduling |
| Phase 2a | Karpenter + llm-d Disaggregated | Mid-scale production, MIG utilization |
| Phase 2b | MNG + DRA + llm-d | P6e-GB200, DRA-required environments |
| Phase 3 | Karpenter + Dynamo | Large scale (16+ GPUs), maximum performance |
Auto Mode and self-managed Karpenter can coexist in the same cluster. In Phase 1.5, add the nvidia.com/gpu.deploy.device-plugin: "false" label to Auto Mode NodePool to prevent Device Plugin conflicts.
Monitoring
Key Monitoring Metrics
| Metric | Description | Normal Range |
|---|---|---|
| vllm_num_requests_running | Number of currently processing requests | Varies by workload |
| vllm_num_requests_waiting | Number of waiting requests | < 50 |
| vllm_gpu_cache_usage_perc | GPU KV Cache utilization | 60-90% |
| vllm_avg_generation_throughput_toks_per_s | Tokens generated per second | Varies by model/GPU |
| vllm_avg_prompt_throughput_toks_per_s | Prompt tokens processed per second | Varies by model/GPU |
| vllm_e2e_request_latency_seconds | End-to-end request latency | P95 < 30s |
Model Loading Time
| Loading Method | Expected Time | Notes |
|---|---|---|
| HuggingFace Hub (initial) | 10-20 min | Varies by network speed |
| S3 Cache | 3-5 min | Loading from same-region S3 |
| Node Local Cache | 1-2 min | When redeploying on the same node |
Cost Optimization
| Strategy | Description | Estimated Savings |
|---|---|---|
| Savings Plans | 1-year/3-year Compute Savings Plans commitment | 30-60% |
| Off-Peak Scale Down | Reduce replicas during nights/weekends (using CronJob) | 40-60% |
| Model Quantization | Reduce GPU count with INT8/INT4 | 50% GPU cost |
| Spot Instances | Apply to fault-tolerant workloads (risk of interruption) | 60-90% |
| TP Optimization | Use minimum TP value appropriate for model size | Avoid unnecessary GPUs |
p5.48xlarge costs approximately $98.32/hr (us-west-2 On-Demand). Running 2 instances costs ~$141,580/month. Always clean up resources after testing.
EKS Auto Mode GPU Instance Support Status (Verified 2026.04)
Instance Support Matrix
| Instance Type | GPU | VRAM (Total) | Auto Mode Support | Verification Status |
|---|---|---|---|---|
| g5.xlarge~48xlarge | A10G | 24~192GB | Normal | Provisioning confirmed |
| g6.xlarge~48xlarge | L4 | 24~192GB | Normal | Provisioning confirmed |
| g6e.xlarge~48xlarge | L40S | 48~384GB | Normal | Provisioning confirmed |
| p4d.24xlarge | A100 40GB x 8 | 320GB | Normal | Dry-run confirmed |
| p5.48xlarge | H100 80GB x 8 | 640GB | Normal | Spot provisioning confirmed (us-east-2) |
| p5en.48xlarge | H200 141GB x 8 | 1,128GB | Limited | Dry-run passes, offering matching may fail |
| p6-b200.48xlarge | B200 192GB x 8 | 1,536GB | Not supported | NoCompatibleInstanceTypes error |
As of April 2026, EKS Auto Mode's managed Karpenter cannot provision p6-b200.48xlarge. Use EKS Standard Mode + Karpenter if p6 instances are needed.
Per-Region GPU Capacity Availability
| Region | p5.48xlarge On-Demand | p5.48xlarge Spot | Spot Price |
|---|---|---|---|
| ap-northeast-2 (Seoul) | InsufficientCapacity | Unconfirmed | -- |
| us-east-2 (Ohio) | Availability varies | Successfully acquired | $13-15/hr |
Spot Price Comparison (us-east-2, 2026.04): p5 instances offer 85-90% cost savings on Spot. For detailed pricing, see EKS GPU Node Strategy — Spot Price Comparison.
GPU Quota Notes
| Quota Name | Applicable Instances | Default |
|---|---|---|
| Running On-Demand P instances | p4d, p4de, p5, p5en | 384 |
| Running On-Demand G and VT instances | g5, g6, g6e | 64 |
When setting instance-category: [g, p] together in GPU NodePool, Karpenter may try G-type instances first. To use P-type only, explicitly specify instance-category: [p].
Next Steps
- EKS GPU Node Strategy -- Auto Mode vs Karpenter vs Hybrid Node, per-model-size cost analysis
- vLLM Model Serving and Performance Optimization -- vLLM basics and deployment
- MoE Model Serving Guide -- Mixture of Experts model serving
- GPU Resource Management -- GPU cluster resource management