7 docs tagged with "optimization"

Disaggregated Serving + LWS Multi-Node

Prefill/Decode separation architecture and NIXL common KV transfer engine, LeaderWorkerSet-based 700B+ large MoE model multi-node deployment guide

EKS Pod Resource Optimization Guide

Kubernetes Pod CPU/Memory resource configuration, QoS classes, VPA/HPA autoscaling, and resource right-sizing strategies

GPU Resources · Observability · Hybrid Node · Lessons Learned

2-Tier GPU autoscaling, DCGM/vLLM monitoring, Bifrost→Bedrock Cascade Fallback, Hybrid Node on-premises integration, large MoE deployment lessons learned

Inference Optimization on EKS

EKS architecture overview for maximizing LLM Inference performance — starting point for vLLM, KV Cache-Aware Routing, Disaggregated Serving, LWS multi-node, and Hybrid Node integration

KV Cache Optimization (vLLM Deep Dive + Cache-Aware Routing)

Summary of core technologies like vLLM PagedAttention, Continuous Batching, FP8 KV Cache, and comparison of llm-d/NVIDIA Dynamo KV Cache-Aware Routing and Gateway configuration

Large-Scale EKS Cost Management: 30-90% Reduction Strategy

FinOps strategies for achieving 30-90% cost reduction in Amazon EKS environments. Includes cost structure analysis, Karpenter optimization, tool selection, and real-world success cases

vLLM Model Serving

vLLM PagedAttention, parallelization strategies, Multi-LoRA, and hardware support architecture