Model Serving & Inference Infrastructure

Methods for deploying and serving LLMs on GPUs and accelerators are covered in two layers.

GPU Infrastructure Layer: The layer that manages GPU instances, drivers, schedulers, and partitioning on top of Kubernetes. Determines which nodes get GPU allocation and how.
Inference Framework Layer: The AI framework layer that actually serves models, performs distributed inference, and fine-tunes on the secured GPUs. vLLM, llm-d, MoE, and NeMo belong here.

EKS GPU node strategy, Karpenter·KEDA·DRA-based resource management, NVIDIA GPU stack (ClusterPolicy·DCGM·MIG·Time-Slicing), AWS Neuron Stack (Trainium2/Inferentia2).

🚀

Inference Frameworks

vLLM PagedAttention·Multi-LoRA, llm-d distributed inference·KV Cache-aware routing, MoE model serving, NVIDIA NeMo training·fine-tuning framework.

Learning Order

Reading in the order GPU Infrastructure → Inference Frameworks is natural. First decide "which nodes, partitioning, and driver stack to use" in GPU Infrastructure, then cover "how to deploy vLLM and llm-d on top of that" in Inference Frameworks.