Custom Model Deployment Guide
Hands-on guide to deploying large open-source models on EKS, based on the GLM-5.1 experience
Hands-on guide to deploying large open-source models on EKS, based on the GLM-5.1 experience
Building a domain-optimized model serving pipeline with LoRA Fine-tuning, Multi-LoRA Hot-swap, and SLM Cascade Routing
vLLM·llm-d·MoE·NeMo — AI framework layer for actual model serving, distributed inference, and fine-tuning on GPUs
EKS architecture overview for maximizing LLM Inference performance — starting point for vLLM, KV Cache-Aware Routing, Disaggregated Serving, LWS multi-node, and Hybrid Node integration
A benchmark plan comparing Bedrock AgentCore as baseline against self-managed EKS (vLLM, llm-d, Bifrost/LiteLLM) across features, performance, and cost
Summary of core technologies like vLLM PagedAttention, Continuous Batching, FP8 KV Cache, and comparison of llm-d/NVIDIA Dynamo KV Cache-Aware Routing and Gateway configuration
Benchmark comparing performance and cost efficiency of GPU instances (p5, p4d, g6e) and AWS custom silicon (Trainium2, Inferentia2) for vLLM-based Llama 4 model serving
llm-d architecture concepts, KV Cache-aware routing, Disaggregated Serving, EKS Auto Mode integration strategy
End-to-end ML lifecycle management with Kubeflow + MLflow + vLLM + ArgoCD GitOps
Model serving guide divided into GPU infrastructure layer and inference/training framework layer
Architecture concepts, distributed deployment strategies, and performance optimization principles for Mixture of Experts models
Benchmark comparing Aggregated vs Disaggregated LLM serving performance using NVIDIA Dynamo — Running AIPerf 4 modes in an EKS environment
vLLM PagedAttention, parallelization strategies, Multi-LoRA, and hardware support architecture