Skip to main content

vLLM Model Serving

Overview

vLLM is a high-performance LLM inference engine that reduces KV cache memory waste by 60-80% through the PagedAttention algorithm and provides 2-24x throughput improvement via Continuous Batching. Major companies including Meta, Mistral AI, Cohere, and IBM use it in production environments, and it provides an OpenAI-compatible API for easy migration of existing applications.

📌 Current Version: vLLM v0.18+ / v0.19.x (as of 2026-04)

Why vLLM Became the Standard

Traditional LLM serving engines statically allocated KV cache memory, resulting in 60-80% memory waste. Static batching waited until a fixed number of requests accumulated, leading to long GPU idle times. vLLM eliminates these two fundamental bottlenecks, providing up to 24x higher throughput on the same hardware.

vLLM core innovations:

  • PagedAttention: Inspired by OS virtual memory management, manages KV cache as non-contiguous blocks
  • Continuous Batching: Removes batch boundaries and dynamically adds/removes requests at the iteration level
  • OpenAI API Compatible: Migration possible without changing existing application code

Core Architecture

PagedAttention and KV Cache Management

Due to the autoregressive nature of Transformer architecture, each request must store key-value pairs from previous tokens. This KV cache grows linearly with input sequence length and concurrent users. Traditional approaches pre-allocate memory for maximum length, wasting space regardless of actual usage.

vLLM's PagedAttention divides KV cache into fixed-size blocks stored non-contiguously. Short requests allocate fewer blocks; longer ones allocate additional blocks as needed. Block tables maintain logical ordering, eliminating memory fragmentation.

Memory efficiency improvement:

  • Traditional: Pre-allocate max sequence length × batch size → 60-80% waste
  • PagedAttention: Dynamically allocate only actual usage → waste eliminated

Continuous Batching

Static batching waits for a fixed number of requests before processing. With irregular request arrivals, GPUs are only partially utilized, reducing throughput. Also, requests that finish early must wait for the entire batch to complete.

vLLM's continuous batching completely removes batch boundaries:

  • Scheduler operates at the iteration level
  • Completed requests are immediately removed and new requests dynamically added
  • GPU always operates at maximum capacity
  • Both average latency and throughput are improved

Speculative Decoding

Speculative decoding uses a small draft model to predict tokens, with the main model verifying in parallel, providing 2-3x speed improvement. Especially effective for predictable outputs (code generation, structured responses).

from vllm import LLM

llm = LLM(
model="large-model",
speculative_model="small-draft-model",
num_speculative_tokens=5
)

V1 Engine Architecture

vLLM v0.19.x introduces the V1 engine with these improvements:

  • Chunked Prefill: Mixes prefill (compute-intensive) and decode (memory-intensive) in the same batch
  • FP8 KV Cache: Reduces KV cache memory by 2x for longer context support
  • Improved Prefix Caching: 400%+ throughput improvement through common prefix reuse

GPU Memory Requirements

Accurately calculate required GPU memory before model deployment. Memory usage breaks down as:

Required GPU Memory = Model Weights + Non-torch Memory + PyTorch Activation Peak Memory + (Per-batch KV Cache Memory × Batch Size)

Model Weight Memory

Determined by parameter count and precision.

PrecisionBytes per Parameter70B Model Memory
FP324280GB
FP16/BF162140GB
INT8170GB
INT40.535GB

Example calculation:

  • Llama-3.3-70B (FP16): 70B × 2 bytes = 140GB (weights only)
  • KV Cache (batch size 256, sequence length 8192): ~40GB
  • Activation and other overhead: ~20GB
  • Total: ~200GB → Not possible on single H100 80GB, TP=4 needed (50GB per GPU)

Quantizing a 70B parameter model to INT4 reduces it to 35GB, making it deployable on a single A100 80GB or H100 with KV cache headroom.

Parallelization Strategies

Large models may not fit on a single GPU or may need multiple GPUs for higher throughput. vLLM supports four parallelization strategies.

Tensor Parallelism (TP)

Distributes parameters within each model layer across multiple GPUs. The most common strategy for deploying large models within a single node.

When to use:

  • When the model doesn't fit on a single GPU
  • When reducing per-GPU memory pressure to free KV cache space
from vllm import LLM

# Distribute model across 4 GPUs
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
tensor_parallel_size=4
)

Constraint: tensor_parallel_size must be a divisor of the model's attention head count. For example, a 70B model with 64 attention heads supports TP=2, 4, 8, 16, etc.

Pipeline Parallelism (PP)

Distributes model layers sequentially across multiple GPUs. Tokens flow through the pipeline sequentially.

When to use:

  • When tensor parallelism is maxed out but more GPUs are needed
  • When multi-node deployment is required
# 4 GPUs tensor parallel, 2 nodes pipeline parallel
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--pipeline-parallel-size 2

Parallelization Strategy Combination Matrix

ScenarioModel SizeGPU ConfigurationParallelization StrategyTP × PP
Small model7B-13B1×H100 80GBNone1 × 1
Medium model32B-70B4×H100 80GB (single node)TP=44 × 1
Large model175B-405B8×H100 (2 nodes)TP=4, PP=24 × 2
Ultra-large model744B MoE16×H100 (2 nodes)TP=8, PP=28 × 2

PP Multi-node Constraints (V1 Engine, 2026.04)

vLLM V1 engine's multiproc_executor performs multi-node synchronization via NCCL TCPStore. For large models (744B class), loading time may exceed VLLM_ENGINE_READY_TIMEOUT_S (default 600s), causing deadlock.

Symptoms: Leader Pod timeout waiting for Worker response → Worker TCPStore Broken pipe error → Cyclical restarts

Solutions:

  1. Use SGLang (recommended): Stably supports multi-node PP
  2. Ray-based vLLM: Ray Cluster configuration (increased operational complexity)
  3. Single node deployment: Use H200 (141GB × 8) or B200 (192GB × 8) to eliminate PP

For details, see the Custom Model Deployment Guide.

Data Parallelism (DP)

Replicates the entire model across multiple servers for independent request processing. Combined with Kubernetes HPA (Horizontal Pod Autoscaler) for elastic scaling.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_waiting
target:
type: AverageValue
averageValue: "10"

Expert Parallelism (EP)

A specialized strategy for MoE (Mixture-of-Experts) models. Tokens are routed only to relevant "experts," reducing unnecessary computation.

vllm serve model-name --enable-expert-parallel

For details, see MoE Model Serving.

Supported Hardware

vLLM v0.19.x supports various hardware accelerators:

HardwareSupport LevelPrimary UseAWS Instance Type
NVIDIA H100 (80GB)Full supportProduction inferencep5.48xlarge (H100×8)
NVIDIA H200 (141GB)Full supportLarge model inferencep5en.48xlarge (H200×8)
NVIDIA B200 (192GB)Full supportUltra-large model inferencep6-b200.48xlarge (B200×8)
NVIDIA L4 (24GB)Full supportCost-efficient inferenceg6e.xlarge~12xlarge (L4×1~8)
AWS Trainium2SupportedAWS native accelerationtrn2.48xlarge (Trn2×16)
AMD MI300XSupportedAlternative GPU infrastructure-

AWS EKS Recommended Configuration:

  • Production: p5.48xlarge (H100 × 8, 640GB HBM3) → Deploy 175B models with TP=8
  • Large models: p5en.48xlarge (H200 × 8, 1,128GB HBM3e) → Deploy 405B models with TP=8
  • Cost optimization: g6e instances (L4) → 7B~13B models, Spot instance utilization

Multi-LoRA Serving

vLLM can simultaneously serve multiple LoRA adapters on a single base model. This enables efficient operation of domain-specific models on a single GPU set, significantly saving GPU resources.

Architecture Concept

Base Model + Adapter Hot-swap:

  • Base Model (70B) is always loaded in GPU memory
  • LoRA adapters (hundreds of MB to several GB) are dynamically loaded/unloaded per request
  • Adapter switching overhead: tens to hundreds of ms (100x faster than full model reloading)

Memory efficiency:

  • Traditional: Per-domain full model × N deployments = 140GB × 5 = 700GB
  • Multi-LoRA: Base Model (140GB) + Adapter cache (10GB) = 150GB

Key Configuration Options

OptionDescriptionDefault
--enable-loraEnable Multi-LoRA servingFalse
--lora-modulesPre-load LoRA modules (name=path)None
--max-lorasMaximum simultaneous loaded LoRAs1
--max-lora-rankMaximum supported LoRA rank16
--lora-extra-vocab-sizeLoRA adapter extra vocabulary size256

Basic usage example:

vllm serve meta-llama/Llama-3.3-70B-Instruct \
--enable-lora \
--lora-modules customer-support=./lora-cs finance=./lora-fin \
--max-loras 4 \
--max-lora-rank 64
Detailed Guide

For Multi-LoRA hot-swap deployment, per-customer adapter routing, A/B testing, and S3 dynamic loading, see the Custom Model Pipeline Guide.

Performance Optimization

Quantization

Balances model quality and memory efficiency.

Quantization MethodMemory SavingsQuality LossInference SpeedRecommended Use
FP850%Minimal (<1%)Fast (H100 optimized)Production inference
AWQ75%Low (1-3%)Very fastHigh-throughput services
GPTQ75%Low (1-3%)FastGPU memory-constrained environments
GGUF50-75%Low-MediumFastCPU/edge deployment

Usage examples:

# FP8 quantization (recommended)
vllm serve Qwen/Qwen3-32B-FP8 --quantization fp8

# AWQ quantization
vllm serve TheBloke/Llama-2-70B-AWQ --quantization awq

# GGUF quantization (vLLM v0.6+)
vllm serve --model TheBloke/Llama-2-70B-GGUF \
--quantization gguf \
--gguf-file llama-2-70b.Q4_K_M.gguf

FP8 reduces memory by half with virtually no quality degradation. INT4 (AWQ, GPTQ) may cause quality degradation in complex reasoning tasks, so per-workload profiling is necessary.

Prefix Caching

Provides 400%+ utilization improvement with standardized system prompts or repeated contexts.

vllm serve model-name --enable-prefix-caching

How it works:

  • System prompt KV cache is computed once and shared
  • Requests with identical prefixes avoid redundant computation
  • Hit rate varies by application (especially effective in RAG systems)

Applicable scenarios:

  • RAG systems (common context reuse)
  • Fixed system prompt usage
  • Few-shot learning (same examples repeated)

Chunked Prefill

Mixes prefill (compute-intensive) and decode (memory-intensive) operations in the same batch, improving both throughput and latency. Enabled by default in vLLM V1.

from vllm import LLM

llm = LLM(
model="model-name",
max_num_batched_tokens=2048 # Tunable
)

Adjust max_num_batched_tokens to balance TTFT (Time To First Token) and throughput:

  • Higher value → increased throughput, increased TTFT
  • Lower value → decreased TTFT, decreased throughput

CUDA Graph

Captures repetitive computation patterns as graphs to reduce GPU kernel execution overhead. Enabled by default in vLLM V1.

vllm serve model-name --enforce-eager  # Disable CUDA Graph (for debugging)

CUDA Graph provides 10-20% performance improvement in most cases, but may add overhead with dynamic sequence length patterns.

DeepGEMM (FP8)

Custom GEMM kernel that accelerates FP8 operations on NVIDIA H100 GPUs.

VLLM_USE_DEEP_GEMM=1 vllm serve model-name --kv-cache-dtype=fp8

Provides 20-30% additional performance improvement when using FP8 models on H100.

Optimization Option Comparison

Optimization TechniqueThroughput ImprovementTTFT ImprovementGPU Memory SavingsImplementation Difficulty
Prefix Caching+400%OOLow (single flag)
FP8 Quantization+50%O50%Low (model selection)
Chunked Prefill+30%+20%-Low (enabled by default)
Speculative Decoding+200%+100%-Medium (draft model)
CUDA Graph+15%O-Low (enabled by default)
DeepGEMM+25%--Low (H100 only)

Monitoring Metrics

vLLM exposes various metrics in Prometheus format.

Key Metrics

MetricDescriptionThreshold Example
vllm:num_requests_runningCurrently processing request count< max_num_seqs
vllm:num_requests_waitingWaiting request count< 50 (prevent overload)
vllm:gpu_cache_usage_percGPU KV cache utilization70-90% (optimal)
vllm:num_preemptions_totalPreempted request count< 10/min (lower is better)
vllm:avg_prompt_throughput_toks_per_sPrompt throughput (tokens/sec)Measure against target
vllm:avg_generation_throughput_toks_per_sGeneration throughput (tokens/sec)Measure against target
vllm:time_to_first_token_secondsTime to First Token (TTFT)< 1s (conversational services)
vllm:time_per_output_token_secondsTime Per Output Token (TPOT)< 0.1s (real-time streaming)
vllm:e2e_request_latency_secondsEnd-to-end request latencyMeasure against target SLA

Preemption Handling

When KV cache space is insufficient, vLLM preempts requests to free space. If the following warning occurs frequently, action is needed:

WARNING Sequence group 0 is preempted by PreemptionMode.RECOMPUTE

Remediation:

  1. Increase gpu_memory_utilization (0.9 → 0.95)
  2. Decrease max_num_seqs or max_num_batched_tokens
  3. Increase tensor_parallel_size to secure per-GPU memory
  4. Decrease max_model_len (match actual workload)
Detailed Guide

For Prometheus + Grafana monitoring stack setup, alert threshold configuration, and dashboard templates, see the Monitoring Stack Setup Guide.

Production Deployment

  • Custom Model Deployment: Kubernetes deployment YAML, LWS multi-node, S3 model cache, vLLM PP multi-node constraint details, coding-specialized model deployment guide
  • Custom Model Pipeline: Multi-LoRA hot-swap, per-customer adapter routing, A/B testing, S3 dynamic loading
  • Monitoring Stack Setup: Prometheus + Grafana setup, alert thresholds, dashboard templates

References