基于 llm-d 的 EKS 分布式推理指南

当前版本：llm-d v0.5+（2026.03）

创建日期：2026-02-10 | 修改日期：2026-04-06 | 阅读时间：约 8 分钟

概述

llm-d 是 Red Hat 主导的 Apache 2.0 许可 Kubernetes 原生分布式推理栈。结合 vLLM 推理引擎、基于 Envoy 的 Inference Gateway 和 Kubernetes Gateway API，为大规模语言模型提供智能推理路由。

传统 vLLM 部署依赖简单的 Round-Robin 负载均衡，而 llm-d 通过 KV Cache 状态感知的智能路由，将相同 prefix 的请求转发到已持有该 KV Cache 的 Pod。从而显著缩短 Time To First Token（TTFT）并节省 GPU 算力。

实战部署指南

llm-d 的 EKS 部署 YAML、helmfile 命令、集群创建等实战部署请参阅自定义模型部署指南。

llm-d Inference Gateway 不等于通用 Gateway API 实现

llm-d 的基于 Envoy 的 Inference Gateway 是专为 LLM 推理请求设计的特殊用途网关。

llm-d Gateway：基于 InferenceModel/InferencePool CRD、KV Cache 感知路由、推理流量专用
通用 Gateway API：基于 HTTPRoute/GRPCRoute、TLS/认证/Rate Limiting、集群全局流量管理

生产环境推荐通用 Gateway API 实现负责集群入口，llm-d 在其下层优化 AI 推理流量。

llm-d 的 3 条 Well-Lit Path

llm-d 提供三条经过验证的部署路径。

llm-d's 3 Well-Lit Paths

Intelligent Inference SchedulingRECOMMENDED

Intelligent request distribution with KV Cache-aware routing

📌 General-purpose LLM serving (this guide)

Prefill/Decode Disaggregation

Separates Prefill and Decode stages for processing

📌 Large batch processing, long context handling

Wide Expert-Parallelism

Distributes MoE model Experts across multiple nodes

📌 MoE models (Mixtral, DeepSeek, etc.)

架构

llm-d 的 Intelligent Inference Scheduling 架构如下构成。

llm-d vs 传统 vLLM 部署对比

llm-d vs Traditional vLLM Deployment Comparison

Feature	Traditional vLLM Deployment	llm-d Deployment ✨
Routing Method	Round-Robin / Random	KV Cache-aware Intelligent Routing
Gateway Integration	Separate Ingress/Service configuration	Native Gateway API integration
Scaling Management	Manual HPA configuration	Automatic management via InferencePool
KV Cache Utilization	Independent management per Pod	Cross-pod prefix reuse for reduced TTFT
Installation Method	Combining individual Helm charts	Unified helmfile deployment (single command)
Model Definition	Writing Deployment YAML directly	Declarative management via InferenceModel CRD

Gateway API CRD

llm-d 使用 Kubernetes Gateway API 和 Inference Extension CRD。

Installed CRDs

Gateway

Defines Envoy-based proxy instances

HTTPRoute

Defines routing rules

InferencePool

Defines vLLM Pod groups (serving endpoint pools)

InferenceModel

Maps model names to InferencePools

默认部署配置

Default Deployment Configuration

Setting	Default Value	Description
Model	Qwen/Qwen3-32B	Apache 2.0, BF16 ~65GB VRAM
vLLM Version	v0.6+	CUDA 12.x support, H100/H200 optimized
Tensor Parallelism	TP=2	2 GPUs per replica
Replicas	8	16 GPUs total (2× p5.48xlarge)
Max Model Length	32,768	Maximum context length
GPU Memory Utilization	0.90	KV Cache allocation ratio

Qwen3-32B 模型选定原因

Why Qwen3-32B Was Selected

Model Name

Qwen/Qwen3-32B

Parameters

32B (Dense)

License

Apache 2.0

Precision

BF16 (~65GB VRAM)

Context

Up to 32,768 tokens

Features

Official default model for llm-d, excellent multilingual support, most popular among open-source LLMs

Qwen3-32B 选定背景

Qwen3-32B 是 llm-d 的官方默认模型，Apache 2.0 许可商业使用自由。BF16 基准约 65GB VRAM，TP=2（2x GPU）可在 H100 80GB 上稳定服务。

KV Cache 感知路由

llm-d 的核心差异化是 KV Cache 状态感知的智能路由。

路由工作原理

请求接收：客户端向 Inference Gateway 发送推理请求
Prefix 分析：Gateway 对请求的 prompt prefix 进行哈希识别
Cache 查询：检查各 vLLM Pod 的 KV Cache 状态，搜索持有该 prefix 的 Pod
智能路由：Cache hit 时路由到该 Pod，miss 时基于负载进行负载均衡
响应返回：vLLM 通过 Gateway 将推理结果返回给客户端

KV Cache 感知路由效果

Effects of KV Cache-aware Routing

Metric	Cache Miss (Traditional)	Cache Hit (llm-d)	Improvement
TTFT (Time To First Token)	High (full prefill required)	Low (prefill skipped)	50-80% reduction
GPU Computation	Full prompt processing	Only new tokens processed	Computation savings
Throughput	Baseline	Improved	1.5-3x improvement

最大化 Cache Hit Rate

在使用相同系统 Prompt 的应用中 KV Cache 感知路由效果最大化。例如 RAG 流水线中反复引用相同上下文文档时，复用该 prefix 的 KV Cache 可显著缩短 TTFT。

EKS Auto Mode 集成

Auto Mode 的优势和限制

优势：

GPU 驱动自动管理：AWS 自动安装和更新 NVIDIA GPU 驱动
NodeClass 自动选择：使用 default NodeClass 时 Auto Mode 自动选择最优 AMI 和驱动版本
运维简化：消除驱动安装、CUDA 版本管理、驱动兼容性验证等运维负担
GPU Operator 可安装：仅通过标签禁用 Device Plugin，DCGM/NFD/GFD 正常运行

限制：

MIG/Time-Slicing 不可：Auto Mode 的 NodeClass 是 AWS 托管（只读），无法设置 GPU 分割
Custom AMI 不可：需要特定 CUDA 版本或驱动锁定时无法应对

Auto Mode vs Karpenter + GPU Operator 对比

标准	EKS Auto Mode	Auto Mode + GPU Operator	Karpenter + GPU Operator
适用模型大小	70B+（GPU 全量利用）	70B+（GPU 全量利用）	7B~30B（可 MIG 分割）
GPU 驱动管理	AWS 自动管理	AWS 自动管理	AMI 预装
Device Plugin	AWS 管理	通过标签禁用	GPU Operator 管理
DCGM 监控	仅基本指标	DCGM Exporter 精细指标	DCGM Exporter 精细指标
MIG / Time-Slicing	不可	不可	可以
KAI Scheduler	不可	可以（依赖 ClusterPolicy）	可以
运维复杂度	低	中	中

按模型大小的详细成本分析请参阅 EKS GPU 节点策略。

GPU 实例规格

p5.48xlarge Instance Specifications

GPU

8× NVIDIA H100 80GB HBM3

GPU Memory

640GB total

vCPU

192

System Memory

2,048 GiB

GPU Interconnect

NVSwitch (900 GB/s)

Network

EFA 3,200 Gbps

Storage

8× 3.84TB NVMe SSD

p5e.48xlarge Instance Specifications (H200)

GPU

8× NVIDIA H200 141GB HBM3

GPU Memory

1,128GB total

vCPU

192

System Memory

2,048 GiB

GPU Interconnect

NVSwitch (900 GB/s)

Network

EFA 3,200 Gbps

Storage

8× 3.84TB NVMe SSD

实例选择指南

p5e.48xlarge（H200）：100B+ 参数模型，最大内存利用
p5.48xlarge（H100）：70B+ 参数模型，最高性能
g6e family（L40S）：13B-70B 模型，性价比推理

llm-d + DRA 使用时节点限制

llm-d ModelService 以 DRA（ResourceClaim） 方式请求 GPU 时，Karpenter 和 EKS Auto Mode 无法配置节点。DRA 的 ResourceSlice 需要节点创建后 DRA Driver 发布，因此 Karpenter 无法进行节点创建前的模拟。

使用 DRA 的部署：必须用 Managed Node Group + Cluster Autoscaler 管理 GPU 节点
不使用 DRA 的部署（nvidia.com/gpu Device Plugin 方式）：Auto Mode 和 Karpenter 正常工作
P6e-GB200 UltraServer：DRA 必须（Device Plugin 不支持）

详情：EKS GPU 节点策略 — DRA 工作负载的 MNG 混合

llm-d v0.5+ 主要功能

功能	说明	状态
Prefill/Decode Disaggregation	将 Prefill 和 Decode 分为独立 Pod 组，最大化大批量和长上下文吞吐量	GA
Expert Parallelism	将 MoE 模型（Mixtral、DeepSeek）的 Expert 分布到多个节点服务	GA
LoRA 适配器热交换	在单一基础模型上动态加载/卸载多个 LoRA 适配器	GA
多模型服务	在一个集群中通过 InferenceModel CRD 同时服务多个模型	GA
Gateway API Inference Extension	基于की InferencePool/InferenceModel CRD 的 K8s 原生路由	GA

Disaggregated Serving 概念

Disaggregated Serving 分离 LLM 推理的两个阶段并分别独立优化：

阶段	特性	优化方向
Prefill	一次性处理整个 Prompt（compute-bound）	集中 GPU 计算、高 TP
Decode	逐 Token 自回归生成（memory-bound）	集中 GPU 内存、低 TP

NIXL（NVIDIA Inference Xfer Library）：Dynamo、llm-d、production-stack、aibrix 等大多数项目使用的公共 KV 传输引擎。通过 GPU 间直接通信（NVLink/RDMA）超高速传输 KV Cache。

EKS Auto Mode 中的 Disaggregated Serving

Auto Mode 中无法进行 MIG 分区，因此以实例（节点）为单位分离 Prefill/Decode 角色。

Prefill NodePool (compute-heavy):
  p5.48xlarge x N 台 -> Prefill Pod (各 TP=4, GPU 4 个)

Decode NodePool (memory-heavy):
  p5.48xlarge x N 台 -> Decode Pod (各 TP=2, GPU 2 个 x 4 Pod/节点)

项目	Auto Mode（节点分离）	Karpenter + GPU Operator（MIG 分离）
分离单位	实例（节点）	GPU 单位（MIG 分区）
GPU 利用率	Decode Pod TP=2 x 4 个/节点可优化	MIG 在单 GPU 内分割，高利用率
运维复杂度	低	中（GPU Operator + MIG 设置）
伸缩	Prefill/Decode 独立伸缩方便	节点内 MIG 重配置时中断

GPU 空闲最小化

推荐策略：先用 Auto Mode 验证，需要成本优化时转向 Karpenter + GPU Operator + MIG。

llm-d vs NVIDIA Dynamo

llm-d 和 NVIDIA Dynamo 都提供 LLM 推理路由/调度但方法不同。详细对比请参阅 NVIDIA GPU 栈 — llm-d vs Dynamo。

项目	llm-d	NVIDIA Dynamo
主导	Red Hat（Apache 2.0）	NVIDIA（Apache 2.0）
架构	Aggregated + Disaggregated	Aggregated + Disaggregated（同等支持）
KV Cache 传输	NIXL（也支持网络）	NIXL（NVLink/RDMA 超高速）
KV Cache 索引	Prefix-aware 路由	Flash Indexer（radix tree）
路由	Gateway API + Envoy EPP	Dynamo Router + 自有 EPP（Gateway API 集成）
Pod 调度	K8s 默认调度器	KAI Scheduler（GPU-aware Pod 放置）
自动伸缩	HPA/KEDA 联动	Planner（SLO：profiling -> autoscale）+ KEDA/HPA
GPU Operator 需要	可选（Auto Mode 兼容）	需要（KAI Scheduler 依赖 ClusterPolicy）
复杂度	低	高
优势	K8s 原生、轻量、快速引入	Flash Indexer、KAI Scheduler、Planner SLO 自动伸缩

选择指南

EKS Auto Mode + 快速启动：llm-d（GPU Operator 可选）
小~中规模（GPU 16 个以下）：llm-d
大规模（GPU 16 个+）、最大吞吐量：Dynamo（Flash Indexer + Planner）
长上下文（128K+）：Dynamo（3-tier KV Cache：GPU->CPU->SSD）
K8s Gateway API 标准遵从：llm-d

从 llm-d 开始，规模增长后转向 Dynamo 是现实的。Dynamo 1.0 可将 llm-d 作为内部组件集成，与其说是完全替代关系，不如将 Dynamo 视为包含 llm-d 的超集。

迁移路径

分阶段迁移路径：

Phase	配置	适用对象
Phase 1	Auto Mode + llm-d	PoC、开发环境、GPU 16 个以下
Phase 1.5	Auto Mode + GPU Operator + llm-d	加强监控/调度
Phase 2a	Karpenter + llm-d Disaggregated	中规模生产、MIG 利用
Phase 2b	MNG + DRA + llm-d	P6e-GB200、DRA 必须环境
Phase 3	Karpenter + Dynamo	大规模（GPU 16 个+）、最大性能

迁移注意事项

Auto Mode 和 Karpenter 自管理可以在同一集群中混用。Phase 1.5 中在 Auto Mode NodePool 添加 nvidia.com/gpu.deploy.device-plugin: "false" 标签防止 Device Plugin 冲突。

监控

主要监控指标

Key Monitoring Metrics

Metric	Description	Normal Range
vllm_num_requests_running	Number of currently processing requests	Varies by workload
vllm_num_requests_waiting	Number of waiting requests	< 50
vllm_gpu_cache_usage_perc	GPU KV Cache utilization	60-90%
vllm_avg_generation_throughput_toks_per_s	Tokens generated per second	Varies by model/GPU
vllm_avg_prompt_throughput_toks_per_s	Prompt tokens processed per second	Varies by model/GPU
vllm_e2e_request_latency_seconds	End-to-end request latency	P95 < 30s

模型加载时间

Expected Time by Model Loading Method

Loading Method	Expected Time	Notes
HuggingFace Hub (initial)	10-20 min	Varies by network speed
S3 Cache	3-5 min	Loading from same-region S3
Node Local Cache	1-2 min	When redeploying on the same node

成本优化

Cost Optimization Strategies

Strategy	Description	Estimated Savings
Savings Plans	1-year/3-year Compute Savings Plans commitment	30-60%
Off-Peak Scale Down	Reduce replicas during nights/weekends (using CronJob)	40-60%
Model Quantization	Reduce GPU count with INT8/INT4	50% GPU cost
Spot Instances	Apply to fault-tolerant workloads (risk of interruption)	60-90%
TP Optimization	Use minimum TP value appropriate for model size	Avoid unnecessary GPUs

成本注意

p5.48xlarge 每小时约 $98.32（us-west-2 On-Demand 基准）。2 台运营时月约 $141,580。测试完成后务必清理资源。

EKS Auto Mode GPU 实例支持现状（2026.04 验证）

实例支持矩阵

实例类型	GPU	VRAM（总计）	Auto Mode 支持	验证状态
g5.xlarge~48xlarge	A10G	24~192GB	正常	配置确认
g6.xlarge~48xlarge	L4	24~192GB	正常	配置确认
g6e.xlarge~48xlarge	L40S	48~384GB	正常	配置确认
p4d.24xlarge	A100 40GB x 8	320GB	正常	dry-run 确认
p5.48xlarge	H100 80GB x 8	640GB	正常	Spot 配置确认（us-east-2）
p5en.48xlarge	H200 141GB x 8	1,128GB	受限	dry-run 通过，offering 匹配可能失败
p6-b200.48xlarge	B200 192GB x 8	1,536GB	不支持	`NoCompatibleInstanceTypes` 发生

p6 实例不支持

截至 2026 年 4 月，EKS Auto Mode 的托管 Karpenter 无法配置 p6-b200.48xlarge。需要 p6 实例时请使用 EKS Standard Mode + Karpenter。

按区域 GPU 容量可用性

区域	p5.48xlarge On-Demand	p5.48xlarge Spot	Spot 价格
ap-northeast-2（首尔）	InsufficientCapacity	未确认	--
us-east-2（Ohio）	可用性波动	获取成功	$13~15/hr

Spot 价格对比（us-east-2，2026.04）：

实例	On-Demand	Spot（最低）	VRAM	节省率
p5.48xlarge	$55/hr	$12.5/hr	640GB	77%
p5en.48xlarge	~$76/hr	$12.1/hr	1,128GB	84%
p6-b200.48xlarge	$114/hr	$11.4/hr	1,536GB	90%

GPU 配额注意事项

配额名称	适用实例	默认值
Running On-Demand P instances	p4d, p4de, p5, p5en	384
Running On-Demand G and VT instances	g5, g6, g6e	64

G 实例配额陷阱

GPU NodePool 中同时设置 instance-category: [g, p] 时，Karpenter 可能先尝试 G 类型实例。只使用 P 类型请明确指定 instance-category: [p]。

下一步

EKS GPU 节点策略 -- Auto Mode vs Karpenter vs Hybrid Node，按模型大小成本分析
vLLM 模型服务及性能优化 -- vLLM 基本概念及部署
MoE 模型服务指南 -- Mixture of Experts 模型服务
GPU 资源管理 -- GPU 集群资源管理

概述​

llm-d 的 3 条 Well-Lit Path​

架构​

llm-d vs 传统 vLLM 部署对比​

Gateway API CRD​

默认部署配置​

Qwen3-32B 模型选定原因​

KV Cache 感知路由​

路由工作原理​

KV Cache 感知路由效果​

EKS Auto Mode 集成​

Auto Mode 的优势和限制​

Auto Mode vs Karpenter + GPU Operator 对比​

GPU 实例规格​

llm-d v0.5+ 主要功能​

Disaggregated Serving 概念​

EKS Auto Mode 中的 Disaggregated Serving​

llm-d vs NVIDIA Dynamo​

迁移路径​

监控​

主要监控指标​

模型加载时间​

成本优化​

EKS Auto Mode GPU 实例支持现状（2026.04 验证）​

实例支持矩阵​

按区域 GPU 容量可用性​

GPU 配额注意事项​

下一步​

参考资料​

概述