Skip to main content

Disaggregated Serving + LWS Multi-Node

Overview

Large LLM inference is divided into two fundamentally different computational stages (Prefill / Decode), each with different hardware requirement profiles. 700B+ models cannot fit in a single node, requiring multi-node pipeline parallelization. This document covers Disaggregated Serving architecture and LeaderWorkerSet (LWS)-based multi-node deployment patterns.

Disaggregated Serving

Need for Prefill/Decode Separation

LLM inference consists of two fundamentally different computational stages.

StageCharacteristicsBottleneckGPU Requirements
PrefillProcess entire input promptCompute-boundHigh compute capability (TP=4)
DecodeSequential token-by-token generationMemory-boundHigh memory bandwidth (TP=2)

Processing both stages in the same Pod causes Prefill's compute load to worsen Decode's latency. Separation enables independent scaling of each stage, maximizing GPU utilization.

Separation Architecture

NIXL: Common KV Cache Transfer Engine

NIXL (NVIDIA Inference Xfer Library) is the common KV transfer engine used by most projects including llm-d, Dynamo, production-stack, and aibrix. It provides ultra-fast GPU-to-GPU KV Cache transfer leveraging NVLink/RDMA.

Disaggregated Serving on EKS Auto Mode

Since MIG partitioning is not possible on Auto Mode, roles are separated at the instance (node) level.

# Prefill-dedicated NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-prefill
spec:
template:
metadata:
labels:
llm-d-role: prefill
spec:
requirements:
- key: eks.amazonaws.com/instance-family
operator: In
values: ["p5"]
nodeClassRef:
group: eks.amazonaws.com
kind: NodeClass
name: default
taints:
- key: llm-d-role
value: prefill
effect: NoSchedule
---
# Decode-dedicated NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-decode
spec:
template:
metadata:
labels:
llm-d-role: decode
spec:
requirements:
- key: eks.amazonaws.com/instance-family
operator: In
values: ["p5"]
nodeClassRef:
group: eks.amazonaws.com
kind: NodeClass
name: default
taints:
- key: llm-d-role
value: decode
effect: NoSchedule

GPU Placement Strategy:

  • Prefill: 2 Prefill Pods per p5.48xlarge (each TP=4, 4 GPUs)
  • Decode: 4 Decode Pods per p5.48xlarge (each TP=2, 2 GPUs)
  • Minimizes GPU idle time

LWS-Based Multi-Node Large Model Serving

LeaderWorkerSet Overview

700B+ large MoE models cannot fit in a single node (8× GPUs), requiring multi-node pipeline parallelization. LeaderWorkerSet (LWS) is a Kubernetes-native multi-node workload pattern that enables multi-node Pipeline Parallelism without Ray.

LWS vs Ray Comparison

ItemLWS + vLLMRay + vLLM
DependenciesLWS CRD onlyRay Cluster (head + worker)
ComplexityLowHigh
Pod ManagementK8s StatefulSet-basedRay's own scheduler
Failure RecoveryRecreateGroupOnPodRestartRay reconnection
EKS Auto ModeCompatibleCompatible

Deployment Example: GLM-5 744B (PP=2, TP=8)

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: vllm-glm5-fp8
namespace: agentic-serving
spec:
replicas: 1
leaderWorkerTemplate:
size: 2 # leader + worker = 2 pods (16 GPUs)
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.18.1
command: ["vllm", "serve"]
args:
- "zai-org/GLM-5-FP8"
- "--tensor-parallel-size=8"
- "--pipeline-parallel-size=2"
- "--gpu-memory-utilization=0.92"
- "--enable-prefix-caching"
env:
- name: VLLM_USE_DEEP_GEMM
value: "1"
- name: NCCL_DEBUG
value: "INFO"
resources:
requests:
nvidia.com/gpu: "8"
volumeMounts:
- name: model-cache
mountPath: /models
- name: dshm
mountPath: /dev/shm
volumes:
- name: model-cache
emptyDir:
sizeLimit: 1Ti
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 32Gi
workerTemplate:
spec:
# Same container spec as leader (only node-rank differs in args)
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.18.1
command: ["vllm", "serve"]
args:
- "zai-org/GLM-5-FP8"
- "--tensor-parallel-size=8"
- "--pipeline-parallel-size=2"
- "--gpu-memory-utilization=0.92"
- "--enable-prefix-caching"
env:
- name: VLLM_USE_DEEP_GEMM
value: "1"
resources:
requests:
nvidia.com/gpu: "8"
volumeMounts:
- name: model-cache
mountPath: /models
- name: dshm
mountPath: /dev/shm
volumes:
- name: model-cache
emptyDir:
sizeLimit: 1Ti
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 32Gi

NCCL / EFA Network Optimization

Inter-node communication performance is critical in multi-node pipeline parallelization. p5.48xlarge provides 3,200 Gbps EFA (Elastic Fabric Adapter).

# NCCL environment variable optimization (add to LWS Pod)
env:
- name: NCCL_DEBUG
value: "INFO"
- name: FI_PROVIDER
value: "efa"
- name: FI_EFA_USE_DEVICE_RDMA
value: "1"
- name: NCCL_ALGO
value: "Ring" # Ring is suitable for multi-node PP
- name: NCCL_PROTO
value: "Simple" # Stable on EFA
- name: NCCL_MIN_NCHANNELS
value: "4"
LWS Failure Recovery

Setting restartPolicy: RecreateGroupOnPodRestart recreates the entire group when either Leader or Worker Pod fails. Multi-node NCCL communication requires all nodes to be synchronized, making full restart more stable than partial restart.

References

Official Documentation

Papers & Technical Blogs