14 docs tagged with "inference"

AWS Neuron Stack — Trainium2/Inferentia2 on EKS

Guide to Neuron SDK, Device Plugin, and NxD Inference for operating AWS custom AI accelerators (Trainium2/Inferentia2) on EKS

Disaggregated Serving + LWS Multi-Node

Prefill/Decode separation architecture and NIXL common KV transfer engine, LeaderWorkerSet-based 700B+ large MoE model multi-node deployment guide

GPU Resources · Observability · Hybrid Node · Lessons Learned

2-Tier GPU autoscaling, DCGM/vLLM monitoring, Bifrost→Bedrock Cascade Fallback, Hybrid Node on-premises integration, large MoE deployment lessons learned

Inference Frameworks

vLLM·llm-d·MoE·NeMo — AI framework layer for actual model serving, distributed inference, and fine-tuning on GPUs

Inference Optimization on EKS

EKS architecture overview for maximizing LLM Inference performance — starting point for vLLM, KV Cache-Aware Routing, Disaggregated Serving, LWS multi-node, and Hybrid Node integration

Inference Platform Benchmark: Bedrock AgentCore vs EKS Self-Managed

A benchmark plan comparing Bedrock AgentCore as baseline against self-managed EKS (vLLM, llm-d, Bifrost/LiteLLM) across features, performance, and cost

KV Cache Optimization (vLLM Deep Dive + Cache-Aware Routing)

Summary of core technologies like vLLM PagedAttention, Continuous Batching, FP8 KV Cache, and comparison of llm-d/NVIDIA Dynamo KV Cache-Aware Routing and Gateway configuration

Llama 4 FM Serving Benchmark: GPU vs AWS Custom Silicon

Benchmark comparing performance and cost efficiency of GPU instances (p5, p4d, g6e) and AWS custom silicon (Trainium2, Inferentia2) for vLLM-based Llama 4 model serving

llm-d Based EKS Distributed Inference Guide

llm-d architecture concepts, KV Cache-aware routing, Disaggregated Serving, EKS Auto Mode integration strategy

Model Serving & Inference Infrastructure

Model serving guide divided into GPU infrastructure layer and inference/training framework layer

MoE Model Serving Concept Guide

Architecture concepts, distributed deployment strategies, and performance optimization principles for Mixture of Experts models

NVIDIA Dynamo Inference Benchmark

Benchmark comparing Aggregated vs Disaggregated LLM serving performance using NVIDIA Dynamo — Running AIPerf 4 modes in an EKS environment

SageMaker-EKS Hybrid ML Architecture

A hybrid ML architecture that trains on SageMaker and serves on EKS

vLLM Model Serving

vLLM PagedAttention, parallelization strategies, Multi-LoRA, and hardware support architecture