7 docs tagged with "kv-cache"

Cache-Hit Strategy

Unifying the three layers of inference caching (KV/Prefix, Prompt, Semantic) into a single decision framework with hit-rate targets, measurement points, and tuning levers for each layer.

HyperPod Inference Operator (Managed KV Cache and Intelligent Routing)

Comparing SageMaker HyperPod Inference Operator's managed KV cache, intelligent routing, and DPD with a Tiered Gateway, and clarifying its role and limitations as an L2 inference routing layer.

KV Cache Optimization (vLLM Deep Dive + Cache-Aware Routing)

Summary of core technologies like vLLM PagedAttention, Continuous Batching, FP8 KV Cache, and comparison of llm-d/NVIDIA Dynamo KV Cache-Aware Routing and Gateway configuration

llm-d Based EKS Distributed Inference Guide

llm-d architecture concepts, KV Cache-aware routing, Disaggregated Serving, EKS Auto Mode integration strategy

LMCache: KV Cache Offloading and Sharing

The concept of LMCache — offloading KV cache beyond GPU memory to CPU and disk and sharing it across inference instances — and its relationship to vLLM prefix cache, NIXL, and kvaware routing.

Model Serving & Inference Infrastructure

A guide to the GPU infrastructure, inference framework, and inference optimization layers, with a single map of the end-to-end LLM inference request path and per-layer tuning levers — inference gateway, prefill/decode disaggregation, KV cache-aware routing, LMCache, and cache-hit strategy.

NVIDIA Dynamo Inference Benchmark

Benchmark comparing Aggregated vs Disaggregated LLM serving performance using NVIDIA Dynamo — Running AIPerf 4 modes in an EKS environment