4 docs tagged with "model-serving"

HyperPod Inference Operator (Managed KV Cache and Intelligent Routing)

Comparing SageMaker HyperPod Inference Operator's managed KV cache, intelligent routing, and DPD with a Tiered Gateway, and clarifying its role and limitations as an L2 inference routing layer.

Model Lifecycle

Custom model deployment, fine-tuning pipelines, MLOps orchestration, continuous training pipelines

Model Serving & Inference Infrastructure

A guide to the GPU infrastructure, inference framework, and inference optimization layers, with a single map of the end-to-end LLM inference request path and per-layer tuning levers — inference gateway, prefill/decode disaggregation, KV cache-aware routing, LMCache, and cache-hit strategy.

MoE Model Serving Concept Guide

Architecture concepts, distributed deployment strategies, and performance optimization principles for Mixture of Experts models