Skip to main content

KV Cache Optimization (vLLM Deep Dive + Cache-Aware Routing)

Overview

LLM inference engine performance largely depends on how efficiently KV Cache (Key-Value Cache) is managed. This document covers vLLM's core technology stack and GPU memory design principles, as well as KV Cache-Aware Routing strategies (llm-d vs NVIDIA Dynamo) for sharing and reusing KV Cache across multiple Pods.

vLLM Deep Dive

Core Technology Stack

vLLM (v0.19.x) is currently the most widely used LLM inference engine. Core technologies and performance impacts are as follows.

TechnologyPerformance ImpactDescription
PagedAttention60-80% KV Cache memory reductionStores KV cache in non-contiguous blocks using OS virtual memory techniques
Continuous Batching2-24x throughput improvementDynamically adds/removes requests at iteration level
FP8 KV Cache2x memory reductionStores KV cache in FP8 precision (v0.6+)
Prefix Caching400%+ improvement for repeated promptsReuses KV cache of common system prompts
Speculative Decoding2-3x speed improvementSmall draft model predicts tokens, main model validates
Chunked PrefillTTFT/throughput balance improvementProcesses Prefill and Decode mixed in same batch

GPU Memory Calculation

Accurate GPU memory calculation is required before model deployment.

Required GPU Memory = Model Weights + Non-torch Memory + PyTorch Activation + (KV Cache × Batch Size)

Memory Requirements by Precision:

PrecisionBytes per Parameter70B Model32B Model
FP324280GB128GB
BF16/FP162140GB64GB
INT8170GB32GB
INT40.535GB16GB

Parallelization Strategy Selection Criteria

Recommended Configuration by Model Size:

Model ExampleParametersPrecisionGPU ConfigurationParallelization
Qwen3-32B32BFP81× H100 80GBNone
Llama-3.3-70B70BBF164× H100 (TP=4)Tensor Parallel
Kimi K2.51T MoE (32B active)INT48× H100 (TP=8)Tensor + Expert Parallel
GLM-5744B MoE (40B active)FP816× H100 (PP=2, TP=8)Pipeline + Tensor Parallel

Core Performance Parameters

vllm serve Qwen/Qwen3-32B-FP8 \
--gpu-memory-utilization=0.95 \ # VRAM ratio to pre-allocate for KV cache (default 0.9)
--max-model-len=32768 \ # Maximum sequence length (directly affects KV cache size)
--enable-prefix-caching \ # Reuse KV cache for common prefixes
--kv-cache-dtype=fp8 \ # 2x memory reduction with FP8 KV cache
--enable-auto-tool-choice \ # Automatic tool calling support
--tool-call-parser=hermes # Tool call parser selection

Quantization Strategy Comparison

QuantizationMemory ReductionQuality LossInference SpeedRecommended Scenario
FP850%MinimalFastProduction default (quality priority)
AWQ75%LowVery fastCost optimization
GPTQ75%LowFastOffline quantization
GGUF50-75%Low~MediumFastVarious precision options

KV Cache-Aware Routing

Existing Problem: Round-Robin Limitations

Existing vLLM deployments rely on simple Round-Robin load balancing. When requests using the same system prompt are distributed to different Pods each time, identical prefill operations are repeatedly executed in each Pod. This wastes GPU computation and increases TTFT.

Solution: KV Cache State-Aware Routing

llm-d and NVIDIA Dynamo recognize the KV Cache state of each vLLM Pod and route requests with identical prefixes to Pods that already hold the corresponding KV Cache.

Effects of KV Cache-Aware Routing:

ScenarioTTFT ImprovementGPU Computation ReductionThroughput Improvement
Same system prompt50-80% reductionPrefill skip400%+
RAG repeated context30-60% reductionPartial reuse200%+
Completely random requestsNo changeNoneLB fallback

llm-d vs NVIDIA Dynamo Comparison

Both projects provide KV Cache-aware routing but with different approaches.

Itemllm-d v0.5+NVIDIA Dynamo v1.0
Led byRed Hat (Apache 2.0)NVIDIA (Apache 2.0)
KV Cache IndexingPrefix-aware routingFlash Indexer (radix tree)
KV Cache TransferNIXL (Network)NIXL (NVLink/RDMA ultra-fast)
RoutingGateway API + Envoy EPPDynamo Router + custom EPP
Pod SchedulingK8s default schedulerKAI Scheduler (GPU-aware)
AutoscalingHPA/KEDA integrationPlanner (SLO-based profiling)
KV Cache TieringMemory only3-tier: GPU→CPU→SSD
ComplexityLowHigh
Benchmark PerformanceLightweight, K8s native7x (Flash Indexer + Planner)
Selection Criteria
  • Small~Medium Scale (GPU ≤16): llm-d — Rapid adoption, K8s Gateway API native
  • Large Scale (GPU 16+), Maximum Throughput: Dynamo — Flash Indexer, SLO-based autoscaling
  • Long Context (128K+): Dynamo — 3-tier KV Cache (GPU→CPU→SSD)
  • Gradual Transition: Start with llm-d → Switch to Dynamo when scaling (both use NIXL)

Gateway Architecture: llm-d Deployment Configuration

References

Official Documentation

Papers & Technical Blogs