Skip to main content

13 docs tagged with "vllm"

View all tags

Custom Model Pipeline Guide

Building a domain-optimized model serving pipeline with LoRA Fine-tuning, Multi-LoRA Hot-swap, and SLM Cascade Routing

Inference Frameworks

vLLM·llm-d·MoE·NeMo — AI framework layer for actual model serving, distributed inference, and fine-tuning on GPUs

Inference Optimization on EKS

EKS architecture overview for maximizing LLM Inference performance — starting point for vLLM, KV Cache-Aware Routing, Disaggregated Serving, LWS multi-node, and Hybrid Node integration

MoE Model Serving Concept Guide

Architecture concepts, distributed deployment strategies, and performance optimization principles for Mixture of Experts models

NVIDIA Dynamo Inference Benchmark

Benchmark comparing Aggregated vs Disaggregated LLM serving performance using NVIDIA Dynamo — Running AIPerf 4 modes in an EKS environment

vLLM Model Serving

vLLM PagedAttention, parallelization strategies, Multi-LoRA, and hardware support architecture