20 docs tagged with "gpu"

Accelerated Computing Infrastructure

EKS GPU node strategy, Karpenter·KEDA·DRA resource management, NVIDIA GPU stack, AWS Neuron stack — the accelerated computing layer covering GPUs and AWS custom accelerators

Agentic AI Platform

In-depth technical documentation on the architecture, deployment, and operations of the Agentic AI Platform

CRIU-based GPU Live Migration (Preview)

Technical status and EKS application scenarios for GPU workload checkpoint/restore during Spot reclaim and scheduling events (Experimental)

Custom Model Deployment Guide

Hands-on guide to deploying large open-source models on EKS, based on the GLM-5.1 experience

EKS GPU Node Strategy

Optimal node strategies for GPU workloads across EKS Auto Mode, Karpenter, MNG, and Hybrid Nodes

EKS Hybrid Nodes Complete Guide

A complete guide for adopting Amazon EKS Hybrid Nodes: architecture, configuration, networking, DNS, GPU servers, cost analysis, and Dynamic Resource Allocation (DRA)

EKS-based Agentic AI Open Architecture

Guide to building Agentic AI platform using Amazon EKS and open-source ecosystem

GPU Autoscaling & Large Model Deployment Operations

2-Tier GPU autoscaling (KEDA·Karpenter), DRA compatibility, and operational lessons learned from large MoE model (GLM-5·Kimi K2.5) deployments for LLM serving

GPU Resource Management

GPU resource management and cost optimization using Karpenter, KEDA, and DRA on EKS

Inference Optimization on EKS

EKS architecture overview for maximizing LLM Inference performance — starting point for vLLM, KV Cache-Aware Routing, Disaggregated Serving, LWS multi-node, and GPU autoscaling

Kubernetes DRA — Dynamic Resource Allocation Framework

A framework guide covering the DRA core model (DeviceClass, ResourceClaim, ResourceSlice), resource types beyond GPUs (NICs, interconnects, FPGAs), and adoption criteria

Llama 4 FM Serving Benchmark: GPU vs AWS Custom Silicon

Benchmark comparing performance and cost efficiency of GPU instances (p5, p4d, g6e) and AWS custom silicon (Trainium2, Inferentia2) for vLLM-based Llama 4 model serving

llm-d Based EKS Distributed Inference Guide

llm-d architecture concepts, KV Cache-aware routing, Disaggregated Serving, EKS Auto Mode integration strategy

Model Serving & Inference Infrastructure

A guide to the GPU infrastructure, inference framework, and inference optimization layers, with a single map of the end-to-end LLM inference request path and per-layer tuning levers — inference gateway, prefill/decode disaggregation, KV cache-aware routing, LMCache, and cache-hit strategy.