13 docs tagged with "vllm"

Custom Model Deployment Guide

Hands-on guide to deploying large open-source models on EKS, based on the GLM-5.1 experience

Custom Model Pipeline Guide

Building a domain-optimized model serving pipeline with LoRA Fine-tuning, Multi-LoRA Hot-swap, and SLM Cascade Routing

Inference Frameworks

vLLM·llm-d·MoE·NeMo — AI framework layer for actual model serving, distributed inference, and fine-tuning on GPUs

Inference Optimization on EKS

EKS architecture overview for maximizing LLM Inference performance — starting point for vLLM, KV Cache-Aware Routing, Disaggregated Serving, LWS multi-node, and Hybrid Node integration

Inference Platform Benchmark: Bedrock AgentCore vs EKS Self-Managed

A benchmark plan comparing Bedrock AgentCore as baseline against self-managed EKS (vLLM, llm-d, Bifrost/LiteLLM) across features, performance, and cost

KV Cache Optimization (vLLM Deep Dive + Cache-Aware Routing)

Summary of core technologies like vLLM PagedAttention, Continuous Batching, FP8 KV Cache, and comparison of llm-d/NVIDIA Dynamo KV Cache-Aware Routing and Gateway configuration

Llama 4 FM Serving Benchmark: GPU vs AWS Custom Silicon

Benchmark comparing performance and cost efficiency of GPU instances (p5, p4d, g6e) and AWS custom silicon (Trainium2, Inferentia2) for vLLM-based Llama 4 model serving

llm-d Based EKS Distributed Inference Guide

llm-d architecture concepts, KV Cache-aware routing, Disaggregated Serving, EKS Auto Mode integration strategy

MLOps Pipeline on EKS

End-to-end ML lifecycle management with Kubeflow + MLflow + vLLM + ArgoCD GitOps

Model Serving & Inference Infrastructure

Model serving guide divided into GPU infrastructure layer and inference/training framework layer

MoE Model Serving Concept Guide

Architecture concepts, distributed deployment strategies, and performance optimization principles for Mixture of Experts models

NVIDIA Dynamo Inference Benchmark

Benchmark comparing Aggregated vs Disaggregated LLM serving performance using NVIDIA Dynamo — Running AIPerf 4 modes in an EKS environment

vLLM Model Serving

vLLM PagedAttention, parallelization strategies, Multi-LoRA, and hardware support architecture