Skip to main content

Inference Frameworks

The AI framework layer on top of GPU Infrastructure that actually performs LLM serving, distributed inference, and fine-tuning. Covers single-node high-performance serving (vLLM), Kubernetes-native distributed inference (llm-d), MoE model processing, and NVIDIA NeMo-based training.

Reading Order

Reading in vLLM → llm-d → MoE → NeMo order follows the progressive difficulty of "single-node optimization → distributed inference → large-scale MoE → training framework."