KV Cache Optimization (vLLM Deep Dive + Cache-Aware Routing)

Overview

LLM inference engine performance largely depends on how efficiently KV Cache (Key-Value Cache) is managed. This document covers vLLM's core technology stack and GPU memory design principles, as well as KV Cache-Aware Routing strategies (llm-d vs NVIDIA Dynamo) for sharing and reusing KV Cache across multiple Pods.

vLLM Deep Dive

Core Technology Stack

vLLM (v0.19.x) is currently the most widely used LLM inference engine. Core technologies and performance impacts are as follows.

Technology	Performance Impact	Description
PagedAttention	60-80% KV Cache memory reduction	Stores KV cache in non-contiguous blocks using OS virtual memory techniques
Continuous Batching	2-24x throughput improvement	Dynamically adds/removes requests at iteration level
FP8 KV Cache	2x memory reduction	Stores KV cache in FP8 precision (v0.6+)
Prefix Caching	400%+ improvement for repeated prompts	Reuses KV cache of common system prompts
Speculative Decoding	2-3x speed improvement	Small draft model predicts tokens, main model validates
Chunked Prefill	TTFT/throughput balance improvement	Processes Prefill and Decode mixed in same batch

GPU Memory Calculation

Accurate GPU memory calculation is required before model deployment.

Required GPU Memory = Model Weights + Non-torch Memory + PyTorch Activation + (KV Cache × Batch Size)

Memory Requirements by Precision:

Precision	Bytes per Parameter	70B Model	32B Model
FP32	4	280GB	128GB
BF16/FP16	2	140GB	64GB
INT8	1	70GB	32GB
INT4	0.5	35GB	16GB

Parallelization Strategy Selection Criteria

Recommended Configuration by Model Size:

Model Example	Parameters	Precision	GPU Configuration	Parallelization
Qwen3-32B	32B	FP8	1× H100 80GB	None
Llama-3.3-70B	70B	BF16	4× H100 (TP=4)	Tensor Parallel
Kimi K2.5	1T MoE (32B active)	INT4	8× H100 (TP=8)	Tensor + Expert Parallel
GLM-5	744B MoE (40B active)	FP8	16× H100 (PP=2, TP=8)	Pipeline + Tensor Parallel

Core Performance Parameters

vllm serve Qwen/Qwen3-32B-FP8 \
  --gpu-memory-utilization=0.95 \   # VRAM ratio to pre-allocate for KV cache (default 0.9)
  --max-model-len=32768 \           # Maximum sequence length (directly affects KV cache size)
  --enable-prefix-caching \         # Reuse KV cache for common prefixes
  --kv-cache-dtype=fp8 \            # 2x memory reduction with FP8 KV cache
  --enable-auto-tool-choice \       # Automatic tool calling support
  --tool-call-parser=hermes         # Tool call parser selection

Quantization Strategy Comparison

Quantization	Memory Reduction	Quality Loss	Inference Speed	Recommended Scenario
FP8	50%	Minimal	Fast	Production default (quality priority)
AWQ	75%	Low	Very fast	Cost optimization
GPTQ	75%	Low	Fast	Offline quantization
GGUF	50-75%	Low~Medium	Fast	Various precision options

KV Cache-Aware Routing

Existing Problem: Round-Robin Limitations

Existing vLLM deployments rely on simple Round-Robin load balancing. When requests using the same system prompt are distributed to different Pods each time, identical prefill operations are repeatedly executed in each Pod. This wastes GPU computation and increases TTFT.

Solution: KV Cache State-Aware Routing

llm-d and NVIDIA Dynamo recognize the KV Cache state of each vLLM Pod and route requests with identical prefixes to Pods that already hold the corresponding KV Cache.

Effects of KV Cache-Aware Routing:

Scenario	TTFT Improvement	GPU Computation Reduction	Throughput Improvement
Same system prompt	50-80% reduction	Prefill skip	400%+
RAG repeated context	30-60% reduction	Partial reuse	200%+
Completely random requests	No change	None	LB fallback

llm-d vs NVIDIA Dynamo Comparison

Both projects provide KV Cache-aware routing but with different approaches.

Item	llm-d v0.5+	NVIDIA Dynamo v1.0
Led by	Red Hat (Apache 2.0)	NVIDIA (Apache 2.0)
KV Cache Indexing	Prefix-aware routing	Flash Indexer (radix tree)
KV Cache Transfer	NIXL (Network)	NIXL (NVLink/RDMA ultra-fast)
Routing	Gateway API + Envoy EPP	Dynamo Router + custom EPP
Pod Scheduling	K8s default scheduler	KAI Scheduler (GPU-aware)
Autoscaling	HPA/KEDA integration	Planner (SLO-based profiling)
KV Cache Tiering	Memory only	3-tier: GPU→CPU→SSD
Complexity	Low	High
Benchmark Performance	Lightweight, K8s native	7x (Flash Indexer + Planner)

Selection Criteria

Small~Medium Scale (GPU ≤16): llm-d — Rapid adoption, K8s Gateway API native
Large Scale (GPU 16+), Maximum Throughput: Dynamo — Flash Indexer, SLO-based autoscaling
Long Context (128K+): Dynamo — 3-tier KV Cache (GPU→CPU→SSD)
Gradual Transition: Start with llm-d → Switch to Dynamo when scaling (both use NIXL)

Gateway Architecture: llm-d Deployment Configuration

References

Official Documentation

vLLM Official Documentation — Optimization and tuning guide
vLLM GitHub — v0.19.x release notes
llm-d GitHub — K8s native distributed inference
NVIDIA Dynamo — Distributed inference framework

Papers & Technical Blogs

PagedAttention Paper (SOSP 2023) — "Efficient Memory Management for Large Language Model Serving with PagedAttention"
Flash Indexer Design (NVIDIA) — Radix tree-based KV Cache indexing
Red Hat llm-d Blog — KV Cache-aware routing design

Disaggregated Serving + LWS Multi-Node — Prefill/Decode separation, NIXL KV transfer
GPU Resources · Observability · Hybrid Node · Lessons Learned — KEDA scaling, monitoring
vLLM-based FM Deployment and Performance Optimization — vLLM detailed guide
llm-d-based EKS Distributed Inference — llm-d deployment guide
NVIDIA GPU Software Stack — GPU Operator, DCGM, Dynamo

Overview​

vLLM Deep Dive​

Core Technology Stack​

GPU Memory Calculation​

Parallelization Strategy Selection Criteria​

Core Performance Parameters​

Quantization Strategy Comparison​

KV Cache-Aware Routing​

Existing Problem: Round-Robin Limitations​

Solution: KV Cache State-Aware Routing​

llm-d vs NVIDIA Dynamo Comparison​

Gateway Architecture: llm-d Deployment Configuration​

References​

Official Documentation​

Papers & Technical Blogs​

Related Documentation​