llm-d Based EKS Distributed Inference Guide

Current Version: llm-d v0.5+ (2026.03)

Overview

llm-d is an Apache 2.0-licensed Kubernetes-native distributed inference stack led by Red Hat. It combines the vLLM inference engine, Envoy-based Inference Gateway, and Kubernetes Gateway API to provide intelligent inference routing for large language models.

While existing vLLM deployments rely on simple Round-Robin load balancing, llm-d delivers intelligent routing that is KV Cache state-aware, forwarding requests with identical prefixes to Pods that already hold the corresponding KV Cache. This significantly reduces Time To First Token (TTFT) and saves GPU computation.

Production Deployment Guide

For llm-d EKS deployment YAML, helmfile commands, and cluster creation, see the Custom Model Deployment Guide.

llm-d Inference Gateway =/= General-purpose Gateway API Implementation

llm-d's Envoy-based Inference Gateway is a special-purpose gateway designed exclusively for LLM inference requests.

llm-d Gateway: InferenceModel/InferencePool CRD-based, KV Cache-aware routing, inference traffic only
General Gateway API: HTTPRoute/GRPCRoute-based, TLS/auth/Rate Limiting, cluster-wide traffic management

In production, the recommended architecture has a general Gateway API implementation handling the cluster entry point, with llm-d optimizing AI inference traffic underneath.

llm-d's 3 Well-Lit Paths

llm-d provides three validated deployment paths.

llm-d's 3 Well-Lit Paths

Intelligent Inference SchedulingRECOMMENDED

Intelligent request distribution with KV Cache-aware routing

📌 General-purpose LLM serving (this guide)

Prefill/Decode Disaggregation

Separates Prefill and Decode stages for processing

📌 Large batch processing, long context handling

Wide Expert-Parallelism

Distributes MoE model Experts across multiple nodes

📌 MoE models (Mixtral, DeepSeek, etc.)

Architecture

llm-d's Intelligent Inference Scheduling architecture is composed as follows.

llm-d vs Traditional vLLM Deployment Comparison

Feature	Traditional vLLM Deployment	llm-d Deployment ✨
Routing Method	Round-Robin / Random	KV Cache-aware Intelligent Routing
Gateway Integration	Separate Ingress/Service configuration	Native Gateway API integration
Scaling Management	Manual HPA configuration	Automatic management via InferencePool
KV Cache Utilization	Independent management per Pod	Cross-pod prefix reuse for reduced TTFT
Installation Method	Combining individual Helm charts	Unified helmfile deployment (single command)
Model Definition	Writing Deployment YAML directly	Declarative management via InferenceModel CRD

Gateway API CRD

llm-d uses Kubernetes Gateway API and Inference Extension CRDs.

Installed CRDs

Gateway

Defines Envoy-based proxy instances

HTTPRoute

Defines routing rules

InferencePool

Defines vLLM Pod groups (serving endpoint pools)

InferenceModel

Maps model names to InferencePools

Default Deployment Configuration

Setting	Default Value	Description
Model	Qwen/Qwen3-32B	Apache 2.0, BF16 ~65GB VRAM
vLLM Version	v0.6+	CUDA 12.x support, H100/H200 optimized
Tensor Parallelism	TP=2	2 GPUs per replica
Replicas	8	16 GPUs total (2× p5.48xlarge)
Max Model Length	32,768	Maximum context length
GPU Memory Utilization	0.90	KV Cache allocation ratio

Qwen3-32B Model Selection Rationale

Why Qwen3-32B Was Selected

Model Name

Qwen/Qwen3-32B

Parameters

32B (Dense)

License

Apache 2.0

Precision

BF16 (~65GB VRAM)

Context

Up to 32,768 tokens

Features

Official default model for llm-d, excellent multilingual support, most popular among open-source LLMs

Qwen3-32B Selection Background

Qwen3-32B is llm-d's official default model and is Apache 2.0-licensed for free commercial use. Requiring ~65GB VRAM at BF16, it can be stably served with TP=2 (2x GPU) on H100 80GB.

KV Cache-aware Routing

The core differentiator of llm-d is intelligent routing that is aware of KV Cache state.

Routing Operation Principles

Request reception: Client sends inference request to Inference Gateway
Prefix analysis: Gateway hashes the request's prompt prefix for identification
Cache lookup: Checks KV Cache state of each vLLM Pod to find Pods holding the prefix
Intelligent routing: Routes to matching Pod on cache hit; load-balanced on miss
Response return: vLLM returns inference results to client via Gateway

KV Cache-aware Routing Effects

Effects of KV Cache-aware Routing

Metric	Cache Miss (Traditional)	Cache Hit (llm-d)	Improvement
TTFT (Time To First Token)	High (full prefill required)	Low (prefill skipped)	50-80% reduction
GPU Computation	Full prompt processing	Only new tokens processed	Computation savings
Throughput	Baseline	Improved	1.5-3x improvement

Maximizing Cache Hit Rate

KV Cache-aware routing is most effective in applications using identical system prompts. For example, in RAG pipelines that repeatedly reference the same context documents, reusing the prefix's KV Cache can significantly reduce TTFT.

EKS Auto Mode Integration

Auto Mode Advantages and Limitations

Advantages:

Automatic GPU driver management: AWS automatically installs and updates NVIDIA GPU drivers
Automatic NodeClass selection: Using default NodeClass lets Auto Mode auto-select optimal AMI and driver version
Operational simplification: Eliminates driver installation, CUDA version management, and driver compatibility verification burden
GPU Operator installable: Only Device Plugin disabled via label; DCGM/NFD/GFD operate normally

Limitations:

MIG/Time-Slicing not available: Auto Mode's NodeClass is AWS-managed (read-only), so GPU partitioning configuration is not possible
Custom AMI not available: Cannot pin specific CUDA versions or drivers

Auto Mode vs Karpenter + GPU Operator Comparison

Auto Mode is suitable for large model serving without GPU driver management burden, while Karpenter is advantageous for workloads requiring advanced GPU features like MIG/Time-Slicing.

Detailed comparison and cost analysis: See EKS GPU Node Strategy — Node Type Comparison

GPU Instance Specifications

p5.48xlarge Instance Specifications

GPU

8× NVIDIA H100 80GB HBM3

GPU Memory

640GB total

vCPU

192

System Memory

2,048 GiB

GPU Interconnect

NVSwitch (900 GB/s)

Network

EFA 3,200 Gbps

Storage

8× 3.84TB NVMe SSD

p5e.48xlarge Instance Specifications (H200)

GPU

8× NVIDIA H200 141GB HBM3

GPU Memory

1,128GB total

vCPU

192

System Memory

2,048 GiB

GPU Interconnect

NVSwitch (900 GB/s)

Network

EFA 3,200 Gbps

Storage

8× 3.84TB NVMe SSD

Instance Selection Guide

p5e.48xlarge (H200): 100B+ parameter models, maximum memory utilization
p5.48xlarge (H100): 70B+ parameter models, highest performance
g6e family (L40S): 13B-70B models, cost-efficient inference

llm-d + DRA Karpenter/Auto Mode Constraints

When llm-d ModelService requests GPUs via DRA (ResourceClaim), node provisioning does not work on Karpenter and EKS Auto Mode. DRA workloads require Managed Node Group + Cluster Autoscaler configuration.

Details: EKS GPU Node Strategy — MNG Hybrid for DRA Workloads

llm-d v0.5+ Key Features

Feature	Description	Status
Prefill/Decode Disaggregation	Separate Prefill and Decode into distinct Pod groups, maximizing throughput for large batches and long contexts	GA
Expert Parallelism	Distributed serving of MoE model (Mixtral, DeepSeek) Experts across multiple nodes	GA
LoRA Adapter Hot-swap	Dynamically load/unload multiple LoRA adapters on a single base model	GA
Multi-model Serving	Simultaneously serve multiple models via InferenceModel CRD in a single cluster	GA
Gateway API Inference Extension	K8s-native routing based on InferencePool/InferenceModel CRDs	GA

Disaggregated Serving Concept

Disaggregated Serving separates the two phases of LLM inference for independent optimization:

Phase	Characteristics	Optimization Direction
Prefill	Processes entire prompt at once (compute-bound)	GPU computing focused, high TP
Decode	Autoregressive token-by-token generation (memory-bound)	GPU memory focused, low TP

NIXL (NVIDIA Inference Xfer Library): Common KV transfer engine used by most projects including Dynamo, llm-d, production-stack, and aibrix. Transfers KV Cache at ultra-high speed via direct GPU communication (NVLink/RDMA).

Disaggregated Serving on EKS Auto Mode

Since MIG partitioning is not possible on Auto Mode, Prefill/Decode roles are separated at the instance (node) level.

Prefill NodePool (compute-heavy):
  p5.48xlarge x N → Prefill Pod (each TP=4, 4 GPUs)

Decode NodePool (memory-heavy):
  p5.48xlarge x N → Decode Pod (each TP=2, 2 GPUs x 4 Pods/node)

Item	Auto Mode (Node Separation)	Karpenter + GPU Operator (MIG Separation)
Separation Unit	Instance (node)	GPU unit (MIG partition)
GPU Utilization	Optimizable with Decode Pod TP=2 x 4/node	High utilization with intra-GPU MIG partitioning
Operational Complexity	Low	Medium (GPU Operator + MIG configuration)
Scaling	Easy independent Prefill/Decode scaling	Node-level MIG reconfiguration causes disruption

Minimizing GPU Idle

Recommended strategy: Validate on Auto Mode first, then transition to Karpenter + GPU Operator + MIG when cost optimization is needed.

llm-d vs NVIDIA Dynamo

llm-d and NVIDIA Dynamo both provide LLM inference routing/scheduling but with different approaches. For detailed comparison, see NVIDIA GPU Stack — llm-d vs Dynamo Selection Guide.

Item	llm-d	NVIDIA Dynamo
Lead	Red Hat (Apache 2.0)	NVIDIA (Apache 2.0)
Architecture	Aggregated + Disaggregated	Aggregated + Disaggregated (equal support)
KV Cache Transfer	NIXL (network supported)	NIXL (NVLink/RDMA ultra-fast)
KV Cache Indexing	Prefix-aware routing	Flash Indexer (radix tree-based)
Routing	Gateway API + Envoy EPP	Dynamo Router + custom EPP (Gateway API integration)
Pod Scheduling	K8s default scheduler	KAI Scheduler (GPU-aware Pod placement)
Autoscaling	HPA/KEDA integration	Planner (SLO-based: profiling → autoscale) + KEDA/HPA
GPU Operator Required	Optional (Auto Mode compatible)	Required (KAI Scheduler's ClusterPolicy dependency)
Complexity	Low	High
Strengths	K8s native, lightweight, fast adoption	Flash Indexer, KAI Scheduler, Planner SLO autoscaling

Selection Guide

EKS Auto Mode + quick start: llm-d (GPU Operator optional)
Small-medium scale (16 GPUs or less): llm-d
Large scale (16+ GPUs), maximum throughput: Dynamo (Flash Indexer + Planner)
Long context (128K+): Dynamo (3-tier KV Cache: GPU→CPU→SSD)
K8s Gateway API standard compliance: llm-d

Starting with llm-d and transitioning to Dynamo as scale grows is practical. Dynamo 1.0 can integrate llm-d as an internal component, making it more of a superset than a complete alternative.

Migration Path

Phased transition path:

Phase	Configuration	Suitable For
Phase 1	Auto Mode + llm-d	PoC, dev environments, 16 GPUs or less
Phase 1.5	Auto Mode + GPU Operator + llm-d	Enhanced monitoring/scheduling
Phase 2a	Karpenter + llm-d Disaggregated	Mid-scale production, MIG utilization
Phase 2b	MNG + DRA + llm-d	P6e-GB200, DRA-required environments
Phase 3	Karpenter + Dynamo	Large scale (16+ GPUs), maximum performance

Transition Notes

Auto Mode and self-managed Karpenter can coexist in the same cluster. In Phase 1.5, add the nvidia.com/gpu.deploy.device-plugin: "false" label to Auto Mode NodePool to prevent Device Plugin conflicts.

Monitoring

Key Monitoring Metrics

Metric	Description	Normal Range
vllm_num_requests_running	Number of currently processing requests	Varies by workload
vllm_num_requests_waiting	Number of waiting requests	< 50
vllm_gpu_cache_usage_perc	GPU KV Cache utilization	60-90%
vllm_avg_generation_throughput_toks_per_s	Tokens generated per second	Varies by model/GPU
vllm_avg_prompt_throughput_toks_per_s	Prompt tokens processed per second	Varies by model/GPU
vllm_e2e_request_latency_seconds	End-to-end request latency	P95 < 30s

Model Loading Time

Expected Time by Model Loading Method

Loading Method	Expected Time	Notes
HuggingFace Hub (initial)	10-20 min	Varies by network speed
S3 Cache	3-5 min	Loading from same-region S3
Node Local Cache	1-2 min	When redeploying on the same node

Cost Optimization

Cost Optimization Strategies

Strategy	Description	Estimated Savings
Savings Plans	1-year/3-year Compute Savings Plans commitment	30-60%
Off-Peak Scale Down	Reduce replicas during nights/weekends (using CronJob)	40-60%
Model Quantization	Reduce GPU count with INT8/INT4	50% GPU cost
Spot Instances	Apply to fault-tolerant workloads (risk of interruption)	60-90%
TP Optimization	Use minimum TP value appropriate for model size	Avoid unnecessary GPUs

Cost Caution

p5.48xlarge costs approximately $98.32/hr (us-west-2 On-Demand). Running 2 instances costs ~$141,580/month. Always clean up resources after testing.

EKS Auto Mode GPU Instance Support Status (Verified 2026.04)

Instance Support Matrix

Instance Type	GPU	VRAM (Total)	Auto Mode Support	Verification Status
g5.xlarge~48xlarge	A10G	24~192GB	Normal	Provisioning confirmed
g6.xlarge~48xlarge	L4	24~192GB	Normal	Provisioning confirmed
g6e.xlarge~48xlarge	L40S	48~384GB	Normal	Provisioning confirmed
p4d.24xlarge	A100 40GB x 8	320GB	Normal	Dry-run confirmed
p5.48xlarge	H100 80GB x 8	640GB	Normal	Spot provisioning confirmed (us-east-2)
p5en.48xlarge	H200 141GB x 8	1,128GB	Limited	Dry-run passes, offering matching may fail
p6-b200.48xlarge	B200 192GB x 8	1,536GB	Not supported	`NoCompatibleInstanceTypes` error

p6 Instance Not Supported

As of April 2026, EKS Auto Mode's managed Karpenter cannot provision p6-b200.48xlarge. Use EKS Standard Mode + Karpenter if p6 instances are needed.

Per-Region GPU Capacity Availability

Region	p5.48xlarge On-Demand	p5.48xlarge Spot	Spot Price
ap-northeast-2 (Seoul)	InsufficientCapacity	Unconfirmed	--
us-east-2 (Ohio)	Availability varies	Successfully acquired	$13-15/hr

Spot Price Comparison (us-east-2, 2026.04): p5 instances offer 85-90% cost savings on Spot. For detailed pricing, see EKS GPU Node Strategy — Spot Price Comparison.

GPU Quota Notes

Quota Name	Applicable Instances	Default
Running On-Demand P instances	p4d, p4de, p5, p5en	384
Running On-Demand G and VT instances	g5, g6, g6e	64

G Instance Quota Trap

When setting instance-category: [g, p] together in GPU NodePool, Karpenter may try G-type instances first. To use P-type only, explicitly specify instance-category: [p].

Next Steps

EKS GPU Node Strategy -- Auto Mode vs Karpenter vs Hybrid Node, per-model-size cost analysis
vLLM Model Serving and Performance Optimization -- vLLM basics and deployment
MoE Model Serving Guide -- Mixture of Experts model serving
GPU Resource Management -- GPU cluster resource management

Overview​

llm-d's 3 Well-Lit Paths​

Architecture​

llm-d vs Traditional vLLM Deployment Comparison​

Gateway API CRD​

Default Deployment Configuration​

Qwen3-32B Model Selection Rationale​

KV Cache-aware Routing​

Routing Operation Principles​

KV Cache-aware Routing Effects​

EKS Auto Mode Integration​

Auto Mode Advantages and Limitations​

Auto Mode vs Karpenter + GPU Operator Comparison​

GPU Instance Specifications​

llm-d v0.5+ Key Features​

Disaggregated Serving Concept​

Disaggregated Serving on EKS Auto Mode​

llm-d vs NVIDIA Dynamo​

Migration Path​

Monitoring​

Key Monitoring Metrics​

Model Loading Time​

Cost Optimization​

EKS Auto Mode GPU Instance Support Status (Verified 2026.04)​

Instance Support Matrix​

Per-Region GPU Capacity Availability​

GPU Quota Notes​

Next Steps​

References​

Overview

llm-d's 3 Well-Lit Paths

Architecture

llm-d vs Traditional vLLM Deployment Comparison

Gateway API CRD

Default Deployment Configuration

Qwen3-32B Model Selection Rationale

KV Cache-aware Routing

Routing Operation Principles

KV Cache-aware Routing Effects

EKS Auto Mode Integration

Auto Mode Advantages and Limitations

Auto Mode vs Karpenter + GPU Operator Comparison

GPU Instance Specifications

llm-d v0.5+ Key Features

Disaggregated Serving Concept

Disaggregated Serving on EKS Auto Mode

llm-d vs NVIDIA Dynamo

Migration Path

Monitoring

Key Monitoring Metrics

Model Loading Time

Cost Optimization

EKS Auto Mode GPU Instance Support Status (Verified 2026.04)

Instance Support Matrix

Per-Region GPU Capacity Availability

GPU Quota Notes

Next Steps

References