Inference Frameworks
The AI framework layer on top of GPU Infrastructure that actually performs LLM serving, distributed inference, and fine-tuning. Covers single-node high-performance serving (vLLM), Kubernetes-native distributed inference (llm-d), MoE model processing, and NVIDIA NeMo-based training.
Reading Order
Reading in vLLM → llm-d → MoE → NeMo order follows the progressive difficulty of "single-node optimization → distributed inference → large-scale MoE → training framework."