vLLM Model Serving

Overview

vLLM is a high-performance LLM inference engine that reduces KV cache memory waste by 60-80% through the PagedAttention algorithm and provides 2-24x throughput improvement via Continuous Batching. Major companies including Meta, Mistral AI, Cohere, and IBM use it in production environments, and it provides an OpenAI-compatible API for easy migration of existing applications.

📌 Current Version: vLLM v0.18+ / v0.19.x (as of 2026-04)

Why vLLM Became the Standard

Traditional LLM serving engines statically allocated KV cache memory, resulting in 60-80% memory waste. Static batching waited until a fixed number of requests accumulated, leading to long GPU idle times. vLLM eliminates these two fundamental bottlenecks, providing up to 24x higher throughput on the same hardware.

vLLM core innovations:

PagedAttention: Inspired by OS virtual memory management, manages KV cache as non-contiguous blocks
Continuous Batching: Removes batch boundaries and dynamically adds/removes requests at the iteration level
OpenAI API Compatible: Migration possible without changing existing application code

Core Architecture

PagedAttention and KV Cache Management

Due to the autoregressive nature of Transformer architecture, each request must store key-value pairs from previous tokens. This KV cache grows linearly with input sequence length and concurrent users. Traditional approaches pre-allocate memory for maximum length, wasting space regardless of actual usage.

vLLM's PagedAttention divides KV cache into fixed-size blocks stored non-contiguously. Short requests allocate fewer blocks; longer ones allocate additional blocks as needed. Block tables maintain logical ordering, eliminating memory fragmentation.

Memory efficiency improvement:

Traditional: Pre-allocate max sequence length × batch size → 60-80% waste
PagedAttention: Dynamically allocate only actual usage → waste eliminated

Continuous Batching

Static batching waits for a fixed number of requests before processing. With irregular request arrivals, GPUs are only partially utilized, reducing throughput. Also, requests that finish early must wait for the entire batch to complete.

vLLM's continuous batching completely removes batch boundaries:

Scheduler operates at the iteration level
Completed requests are immediately removed and new requests dynamically added
GPU always operates at maximum capacity
Both average latency and throughput are improved

Speculative Decoding

Speculative decoding uses a small draft model to predict tokens, with the main model verifying in parallel, providing 2-3x speed improvement. Especially effective for predictable outputs (code generation, structured responses).

from vllm import LLM

llm = LLM(
    model="large-model",
    speculative_model="small-draft-model",
    num_speculative_tokens=5
)

V1 Engine Architecture

vLLM v0.19.x introduces the V1 engine with these improvements:

Chunked Prefill: Mixes prefill (compute-intensive) and decode (memory-intensive) in the same batch
FP8 KV Cache: Reduces KV cache memory by 2x for longer context support
Improved Prefix Caching: 400%+ throughput improvement through common prefix reuse

GPU Memory Requirements

Accurately calculate required GPU memory before model deployment. Memory usage breaks down as:

Required GPU Memory = Model Weights + Non-torch Memory + PyTorch Activation Peak Memory + (Per-batch KV Cache Memory × Batch Size)

Model Weight Memory

Determined by parameter count and precision.

Precision	Bytes per Parameter	70B Model Memory
FP32	4	280GB
FP16/BF16	2	140GB
INT8	1	70GB
INT4	0.5	35GB

Example calculation:

Llama-3.3-70B (FP16): 70B × 2 bytes = 140GB (weights only)
KV Cache (batch size 256, sequence length 8192): ~40GB
Activation and other overhead: ~20GB
Total: ~200GB → Not possible on single H100 80GB, TP=4 needed (50GB per GPU)

Quantizing a 70B parameter model to INT4 reduces it to 35GB, making it deployable on a single A100 80GB or H100 with KV cache headroom.

Parallelization Strategies

Large models may not fit on a single GPU or may need multiple GPUs for higher throughput. vLLM supports four parallelization strategies.

Tensor Parallelism (TP)

Distributes parameters within each model layer across multiple GPUs. The most common strategy for deploying large models within a single node.

When to use:

When the model doesn't fit on a single GPU
When reducing per-GPU memory pressure to free KV cache space

from vllm import LLM

# Distribute model across 4 GPUs
llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    tensor_parallel_size=4
)

Constraint: tensor_parallel_size must be a divisor of the model's attention head count. For example, a 70B model with 64 attention heads supports TP=2, 4, 8, 16, etc.

Pipeline Parallelism (PP)

Distributes model layers sequentially across multiple GPUs. Tokens flow through the pipeline sequentially.

When to use:

When tensor parallelism is maxed out but more GPUs are needed
When multi-node deployment is required

# 4 GPUs tensor parallel, 2 nodes pipeline parallel
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2

Parallelization Strategy Combination Matrix

Scenario	Model Size	GPU Configuration	Parallelization Strategy	TP × PP
Small model	7B-13B	1×H100 80GB	None	1 × 1
Medium model	32B-70B	4×H100 80GB (single node)	TP=4	4 × 1
Large model	175B-405B	8×H100 (2 nodes)	TP=4, PP=2	4 × 2
Ultra-large model	744B MoE	16×H100 (2 nodes)	TP=8, PP=2	8 × 2

PP Multi-node Constraints (V1 Engine, 2026.04)

vLLM V1 engine's multiproc_executor performs multi-node synchronization via NCCL TCPStore. For large models (744B class), loading time may exceed VLLM_ENGINE_READY_TIMEOUT_S (default 600s), causing deadlock.

Symptoms: Leader Pod timeout waiting for Worker response → Worker TCPStore Broken pipe error → Cyclical restarts

Solutions:

Use SGLang (recommended): Stably supports multi-node PP
Ray-based vLLM: Ray Cluster configuration (increased operational complexity)
Single node deployment: Use H200 (141GB × 8) or B200 (192GB × 8) to eliminate PP

For details, see the Custom Model Deployment Guide.

Data Parallelism (DP)

Replicates the entire model across multiple servers for independent request processing. Combined with Kubernetes HPA (Horizontal Pod Autoscaler) for elastic scaling.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_num_requests_waiting
      target:
        type: AverageValue
        averageValue: "10"

Expert Parallelism (EP)

A specialized strategy for MoE (Mixture-of-Experts) models. Tokens are routed only to relevant "experts," reducing unnecessary computation.

vllm serve model-name --enable-expert-parallel

For details, see MoE Model Serving.

Supported Hardware

vLLM v0.19.x supports various hardware accelerators:

Hardware	Support Level	Primary Use	AWS Instance Type
NVIDIA H100 (80GB)	Full support	Production inference	p5.48xlarge (H100×8)
NVIDIA H200 (141GB)	Full support	Large model inference	p5en.48xlarge (H200×8)
NVIDIA B200 (192GB)	Full support	Ultra-large model inference	p6-b200.48xlarge (B200×8)
NVIDIA L4 (24GB)	Full support	Cost-efficient inference	g6e.xlarge~12xlarge (L4×1~8)
AWS Trainium2	Supported	AWS native acceleration	trn2.48xlarge (Trn2×16)
AMD MI300X	Supported	Alternative GPU infrastructure	-

AWS EKS Recommended Configuration:

Production: p5.48xlarge (H100 × 8, 640GB HBM3) → Deploy 175B models with TP=8
Large models: p5en.48xlarge (H200 × 8, 1,128GB HBM3e) → Deploy 405B models with TP=8
Cost optimization: g6e instances (L4) → 7B~13B models, Spot instance utilization

Multi-LoRA Serving

vLLM can simultaneously serve multiple LoRA adapters on a single base model. This enables efficient operation of domain-specific models on a single GPU set, significantly saving GPU resources.

Architecture Concept

Base Model + Adapter Hot-swap:

Base Model (70B) is always loaded in GPU memory
LoRA adapters (hundreds of MB to several GB) are dynamically loaded/unloaded per request
Adapter switching overhead: tens to hundreds of ms (100x faster than full model reloading)

Memory efficiency:

Traditional: Per-domain full model × N deployments = 140GB × 5 = 700GB
Multi-LoRA: Base Model (140GB) + Adapter cache (10GB) = 150GB

Key Configuration Options

Option	Description	Default
--enable-lora	Enable Multi-LoRA serving	False
--lora-modules	Pre-load LoRA modules (name=path)	None
--max-loras	Maximum simultaneous loaded LoRAs	1
--max-lora-rank	Maximum supported LoRA rank	16
--lora-extra-vocab-size	LoRA adapter extra vocabulary size	256

Basic usage example:

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --enable-lora \
  --lora-modules customer-support=./lora-cs finance=./lora-fin \
  --max-loras 4 \
  --max-lora-rank 64

Detailed Guide

For Multi-LoRA hot-swap deployment, per-customer adapter routing, A/B testing, and S3 dynamic loading, see the Custom Model Pipeline Guide.

Performance Optimization

Quantization

Balances model quality and memory efficiency.

Quantization Method	Memory Savings	Quality Loss	Inference Speed	Recommended Use
FP8	50%	Minimal (<1%)	Fast (H100 optimized)	Production inference
AWQ	75%	Low (1-3%)	Very fast	High-throughput services
GPTQ	75%	Low (1-3%)	Fast	GPU memory-constrained environments
GGUF	50-75%	Low-Medium	Fast	CPU/edge deployment

Usage examples:

# FP8 quantization (recommended)
vllm serve Qwen/Qwen3-32B-FP8 --quantization fp8

# AWQ quantization
vllm serve TheBloke/Llama-2-70B-AWQ --quantization awq

# GGUF quantization (vLLM v0.6+)
vllm serve --model TheBloke/Llama-2-70B-GGUF \
  --quantization gguf \
  --gguf-file llama-2-70b.Q4_K_M.gguf

FP8 reduces memory by half with virtually no quality degradation. INT4 (AWQ, GPTQ) may cause quality degradation in complex reasoning tasks, so per-workload profiling is necessary.

Prefix Caching

Provides 400%+ utilization improvement with standardized system prompts or repeated contexts.

vllm serve model-name --enable-prefix-caching

How it works:

System prompt KV cache is computed once and shared
Requests with identical prefixes avoid redundant computation
Hit rate varies by application (especially effective in RAG systems)

Applicable scenarios:

RAG systems (common context reuse)
Fixed system prompt usage
Few-shot learning (same examples repeated)

Chunked Prefill

Mixes prefill (compute-intensive) and decode (memory-intensive) operations in the same batch, improving both throughput and latency. Enabled by default in vLLM V1.

from vllm import LLM

llm = LLM(
    model="model-name",
    max_num_batched_tokens=2048  # Tunable
)

Adjust max_num_batched_tokens to balance TTFT (Time To First Token) and throughput:

Higher value → increased throughput, increased TTFT
Lower value → decreased TTFT, decreased throughput

CUDA Graph

Captures repetitive computation patterns as graphs to reduce GPU kernel execution overhead. Enabled by default in vLLM V1.

vllm serve model-name --enforce-eager  # Disable CUDA Graph (for debugging)

CUDA Graph provides 10-20% performance improvement in most cases, but may add overhead with dynamic sequence length patterns.

DeepGEMM (FP8)

Custom GEMM kernel that accelerates FP8 operations on NVIDIA H100 GPUs.

VLLM_USE_DEEP_GEMM=1 vllm serve model-name --kv-cache-dtype=fp8

Provides 20-30% additional performance improvement when using FP8 models on H100.

Optimization Option Comparison

Optimization Technique	Throughput Improvement	TTFT Improvement	GPU Memory Savings	Implementation Difficulty
Prefix Caching	+400%	O	O	Low (single flag)
FP8 Quantization	+50%	O	50%	Low (model selection)
Chunked Prefill	+30%	+20%	-	Low (enabled by default)
Speculative Decoding	+200%	+100%	-	Medium (draft model)
CUDA Graph	+15%	O	-	Low (enabled by default)
DeepGEMM	+25%	-	-	Low (H100 only)

Monitoring Metrics

vLLM exposes various metrics in Prometheus format.

Key Metrics

Metric	Description	Threshold Example
vllm:num_requests_running	Currently processing request count	< max_num_seqs
vllm:num_requests_waiting	Waiting request count	< 50 (prevent overload)
vllm:gpu_cache_usage_perc	GPU KV cache utilization	70-90% (optimal)
vllm:num_preemptions_total	Preempted request count	< 10/min (lower is better)
vllm:avg_prompt_throughput_toks_per_s	Prompt throughput (tokens/sec)	Measure against target
vllm:avg_generation_throughput_toks_per_s	Generation throughput (tokens/sec)	Measure against target
vllm:time_to_first_token_seconds	Time to First Token (TTFT)	< 1s (conversational services)
vllm:time_per_output_token_seconds	Time Per Output Token (TPOT)	< 0.1s (real-time streaming)
vllm:e2e_request_latency_seconds	End-to-end request latency	Measure against target SLA

Preemption Handling

When KV cache space is insufficient, vLLM preempts requests to free space. If the following warning occurs frequently, action is needed:

WARNING Sequence group 0 is preempted by PreemptionMode.RECOMPUTE

Remediation:

Increase gpu_memory_utilization (0.9 → 0.95)
Decrease max_num_seqs or max_num_batched_tokens
Increase tensor_parallel_size to secure per-GPU memory
Decrease max_model_len (match actual workload)

Detailed Guide

For Prometheus + Grafana monitoring stack setup, alert threshold configuration, and dashboard templates, see the Monitoring Stack Setup Guide.

Production Deployment

Custom Model Deployment: Kubernetes deployment YAML, LWS multi-node, S3 model cache, vLLM PP multi-node constraint details, coding-specialized model deployment guide
Custom Model Pipeline: Multi-LoRA hot-swap, per-customer adapter routing, A/B testing, S3 dynamic loading
Monitoring Stack Setup: Prometheus + Grafana setup, alert thresholds, dashboard templates

llm-d EKS Auto Mode: Disaggregated Serving via vLLM + llm-d integration
MoE Model Serving: Expert Parallelism, GLM-5/Kimi K2.5 deployment strategy
GPU Resource Management: Karpenter, KEDA, GPU Operator configuration

References

GenAI on EKS Starter Kit: Bifrost, vLLM, Langfuse, Milvus and other GenAI component deployment automation
Scalable Model Inference and Agentic AI on Amazon EKS: Comprehensive architecture including llm-d, Karpenter, RAG workflows
vLLM Official Documentation: Optimization and tuning guide
vLLM Kubernetes Deployment Guide

Overview​

Why vLLM Became the Standard​

Core Architecture​

PagedAttention and KV Cache Management​

Continuous Batching​

Speculative Decoding​

V1 Engine Architecture​

GPU Memory Requirements​

Model Weight Memory​

Parallelization Strategies​

Tensor Parallelism (TP)​

Pipeline Parallelism (PP)​

Parallelization Strategy Combination Matrix​

PP Multi-node Constraints (V1 Engine, 2026.04)​

Data Parallelism (DP)​

Expert Parallelism (EP)​

Supported Hardware​

Multi-LoRA Serving​

Architecture Concept​

Key Configuration Options​

Performance Optimization​

Quantization​

Prefix Caching​

Chunked Prefill​

CUDA Graph​

DeepGEMM (FP8)​

Optimization Option Comparison​

Monitoring Metrics​

Key Metrics​

Preemption Handling​

Related Documents​

Production Deployment​

Related Technologies​

References​