Inference Frameworks

The AI framework layer on top of GPU Infrastructure that actually performs LLM serving, distributed inference, and fine-tuning. Covers single-node high-performance serving (vLLM), Kubernetes-native distributed inference (llm-d), MoE model processing, and NVIDIA NeMo-based training.

🚀

vLLM Model Serving

High-performance LLM inference with PagedAttention, Continuous Batching, Tensor/Pipeline Parallelism, Multi-LoRA hot-swapping.

🔀

llm-d Distributed Inference

Kubernetes-native distributed inference scheduler, KV Cache-aware routing, Prefix Cache optimization, Disaggregated Serving.

🧩

MoE Model Serving

Efficient serving of Mixture of Experts models — Expert Parallelism, dynamic routing, memory optimization.

🧠

NeMo Framework

Large-scale training and fine-tuning with NVIDIA NeMo, distributed training, EFA high-speed networking, checkpointing.

⚡

Semantic Caching Strategy

LLM Gateway-level semantic caching — similarity threshold design, 3-layer cache distinction (KV/Prompt/Semantic), observability & security guide.

Reading Order

Reading in vLLM → llm-d → MoE → NeMo order follows the progressive difficulty of "single-node optimization → distributed inference → large-scale MoE → training framework."