2 docs tagged with "serving"

Inference Frameworks

vLLM·llm-d·MoE·NeMo — AI framework layer for actual model serving, distributed inference, and fine-tuning on GPUs

vLLM PagedAttention, parallelization strategies, Multi-LoRA, and hardware support architecture