Model Serving & Inference Infrastructure
Methods for deploying and serving LLMs on GPUs and accelerators are covered in two layers.
- GPU Infrastructure Layer: The layer that manages GPU instances, drivers, schedulers, and partitioning on top of Kubernetes. Determines which nodes get GPU allocation and how.
- Inference Framework Layer: The AI framework layer that actually serves models, performs distributed inference, and fine-tunes on the secured GPUs. vLLM, llm-d, MoE, and NeMo belong here.
Learning Order
Reading in the order GPU Infrastructure → Inference Frameworks is natural. First decide "which nodes, partitioning, and driver stack to use" in GPU Infrastructure, then cover "how to deploy vLLM and llm-d on top of that" in Inference Frameworks.