Disaggregated Serving + LWS Multi-Node
Prefill/Decode separation architecture and NIXL common KV transfer engine, LeaderWorkerSet-based 700B+ large MoE model multi-node deployment guide
Prefill/Decode separation architecture and NIXL common KV transfer engine, LeaderWorkerSet-based 700B+ large MoE model multi-node deployment guide
vLLM·llm-d·MoE·NeMo — AI framework layer for actual model serving, distributed inference, and fine-tuning on GPUs
A benchmark plan comparing Bedrock AgentCore as baseline against self-managed EKS (vLLM, llm-d, Bifrost/LiteLLM) across features, performance, and cost
Summary of core technologies like vLLM PagedAttention, Continuous Batching, FP8 KV Cache, and comparison of llm-d/NVIDIA Dynamo KV Cache-Aware Routing and Gateway configuration
llm-d architecture concepts, KV Cache-aware routing, Disaggregated Serving, EKS Auto Mode integration strategy
Model serving guide divided into GPU infrastructure layer and inference/training framework layer