vLLM¶
High-throughput LLM serving engine with PagedAttention for efficient memory management. Supports continuous batching, prefix caching, multi-LoRA, and quantization (FP8, GPTQ, AWQ).
| Category | llm-model |
| Official Docs | vLLM Documentation |
| CLI Install | ./cli llm-model vllm install |
| CLI Uninstall | ./cli llm-model vllm uninstall |
| Namespace | vllm |
Overview¶
vLLM is a fast and memory-efficient inference engine for large language models. Key features: - PagedAttention: Efficient KV cache management inspired by virtual memory paging - Continuous Batching: Dynamic request batching for higher throughput - Prefix Caching: Cache common prompt prefixes to reduce TTFT - Quantization Support: FP8, GPTQ, AWQ, SqueezeLLM - Multi-LoRA: Serve multiple LoRA adapters on a single base model - Tensor/Pipeline Parallel: Distributed inference across multiple GPUs - Neuron Support: AWS Inferentia2/Trainium accelerators
Installation¶
The installer: 1. Prompts for model selection from config.json 2. Downloads model to PVC (if not cached) 3. Deploys vLLM server with optimal settings 4. Exposes OpenAI-compatible API endpoint
Model Selection¶
Pre-configured models in config.json:
{
"llm-model": {
"vllm": {
"models": [
{ "name": "qwen3-30b-instruct-fp8", "deploy": true },
{ "name": "qwen3-32b-fp8", "deploy": true },
{ "name": "deepseek-r1-qwen3-8b", "deploy": false },
{ "name": "qwen3-8b-neuron", "deploy": false }
]
}
}
}
Verification¶
# Check vLLM pods
kubectl get pods -n vllm
# Check service
kubectl get svc -n vllm
# Port-forward for testing
kubectl port-forward svc/vllm 8000:8000 -n vllm --address 0.0.0.0 &
# List models
curl http://localhost:8000/v1/models
# Test completion
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-30b-instruct-fp8",
"prompt": "Once upon a time",
"max_tokens": 50
}'
# Test chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-30b-instruct-fp8",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"max_tokens": 200
}'
Configuration¶
Model Download¶
Models are downloaded to PVC during installation:
- Storage: Uses platform-specific StorageClass (NFS for K8s, EFS for EKS)
- Path:
/models/<model-name>/ - Auto-detection: Skips download if model files exist
vLLM Arguments¶
Common arguments configured via installer or config.json:
# GPU memory utilization
--gpu-memory-utilization 0.90
# Tensor parallel size (split model across N GPUs)
--tensor-parallel-size 2
# Max model length (context size)
--max-model-len 8192
# Quantization
--quantization fp8
# Block size (larger = more memory, fewer blocks)
--block-size 128
# Enable prefix caching
--enable-prefix-caching
Neuron Support¶
For AWS Inferentia2/Trainium:
# Select Neuron-optimized model
? Select model: qwen3-8b-neuron
# vLLM automatically uses neuron backend
--device neuron
--tensor-parallel-size 2 # Neuron cores
Advanced Features¶
Multi-LoRA Serving¶
Serve multiple LoRA adapters on a single base model:
Then specify LoRA at request time:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "base-model",
"prompt": "Once upon a time",
"lora_name": "my-lora-adapter"
}'
Speculative Decoding¶
Use smaller draft model to speed up generation:
Chunked Prefill¶
Enable for long prompts (reduces TTFT at cost of throughput):
Quantization¶
FP8 (H100/H200)¶
GPTQ¶
AWQ¶
Model Management¶
List Downloaded Models¶
Add New Model¶
-
Edit
config.json: -
Create model config at
components/llm-model/vllm/models/my-new-model.yaml: -
Reinstall:
Integration with LiteLLM¶
vLLM endpoints are automatically discovered by LiteLLM:
model_list:
- model_name: qwen3-30b-instruct-fp8
litellm_params:
model: openai/qwen3-30b-instruct-fp8
api_base: http://vllm.vllm:8000/v1
Performance Tuning¶
Memory Optimization¶
# Increase GPU memory for KV cache
--gpu-memory-utilization 0.95
# Reduce block size (more blocks, less memory per block)
--block-size 64
# Enable CPU offloading
--cpu-offload-gb 16
Throughput Optimization¶
# Increase max parallel requests
--max-num-seqs 256
# Larger block size
--block-size 128
# Disable unnecessary features
--disable-log-requests
Latency Optimization¶
# Enable prefix caching
--enable-prefix-caching
# Chunked prefill for long prompts
--enable-chunked-prefill
# Speculative decoding
--speculative-model <draft-model>
Troubleshooting¶
OOM (Out of Memory)¶
# Reduce GPU memory utilization
--gpu-memory-utilization 0.80
# Reduce max model length
--max-model-len 4096
# Use quantization
--quantization fp8
Low throughput¶
# Check logs for warnings
kubectl logs -n vllm -l app=vllm
# Increase max num seqs
--max-num-seqs 512
# Check GPU utilization
kubectl exec -it -n vllm <vllm-pod> -- nvidia-smi
Model download fails¶
# Check PVC exists and is bound
kubectl get pvc -n vllm
# Check HuggingFace token (for gated models)
kubectl get secret -n vllm huggingface-secret
# Manual download
kubectl exec -it -n vllm <vllm-pod> -- bash
huggingface-cli download <model-id> --local-dir /models/<model-name>