Disaggregated Serving + LWS Multi-Node

Overview

Large LLM inference is divided into two fundamentally different computational stages (Prefill / Decode), each with different hardware requirement profiles. 700B+ models cannot fit in a single node, requiring multi-node pipeline parallelization. This document covers Disaggregated Serving architecture and LeaderWorkerSet (LWS)-based multi-node deployment patterns.

Disaggregated Serving

Need for Prefill/Decode Separation

LLM inference consists of two fundamentally different computational stages.

Stage	Characteristics	Bottleneck	GPU Requirements
Prefill	Process entire input prompt	Compute-bound	High compute capability (TP=4)
Decode	Sequential token-by-token generation	Memory-bound	High memory bandwidth (TP=2)

Processing both stages in the same Pod causes Prefill's compute load to worsen Decode's latency. Separation enables independent scaling of each stage, maximizing GPU utilization.

Separation Architecture

NIXL: Common KV Cache Transfer Engine

NIXL (NVIDIA Inference Xfer Library) is the common KV transfer engine used by most projects including llm-d, Dynamo, production-stack, and aibrix. It provides ultra-fast GPU-to-GPU KV Cache transfer leveraging NVLink/RDMA.

Disaggregated Serving on EKS Auto Mode

Since MIG partitioning is not possible on Auto Mode, roles are separated at the instance (node) level.

# Prefill-dedicated NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-prefill
spec:
  template:
    metadata:
      labels:
        llm-d-role: prefill
    spec:
      requirements:
        - key: eks.amazonaws.com/instance-family
          operator: In
          values: ["p5"]
      nodeClassRef:
        group: eks.amazonaws.com
        kind: NodeClass
        name: default
      taints:
        - key: llm-d-role
          value: prefill
          effect: NoSchedule
---
# Decode-dedicated NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-decode
spec:
  template:
    metadata:
      labels:
        llm-d-role: decode
    spec:
      requirements:
        - key: eks.amazonaws.com/instance-family
          operator: In
          values: ["p5"]
      nodeClassRef:
        group: eks.amazonaws.com
        kind: NodeClass
        name: default
      taints:
        - key: llm-d-role
          value: decode
          effect: NoSchedule

GPU Placement Strategy:

Prefill: 2 Prefill Pods per p5.48xlarge (each TP=4, 4 GPUs)
Decode: 4 Decode Pods per p5.48xlarge (each TP=2, 2 GPUs)
Minimizes GPU idle time

LWS-Based Multi-Node Large Model Serving

LeaderWorkerSet Overview

700B+ large MoE models cannot fit in a single node (8× GPUs), requiring multi-node pipeline parallelization. LeaderWorkerSet (LWS) is a Kubernetes-native multi-node workload pattern that enables multi-node Pipeline Parallelism without Ray.

LWS vs Ray Comparison

Item	LWS + vLLM	Ray + vLLM
Dependencies	LWS CRD only	Ray Cluster (head + worker)
Complexity	Low	High
Pod Management	K8s StatefulSet-based	Ray's own scheduler
Failure Recovery	RecreateGroupOnPodRestart	Ray reconnection
EKS Auto Mode	Compatible	Compatible

Deployment Example: GLM-5 744B (PP=2, TP=8)

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm-glm5-fp8
  namespace: agentic-serving
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2  # leader + worker = 2 pods (16 GPUs)
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      spec:
        tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
        containers:
          - name: vllm
            image: vllm/vllm-openai:v0.18.1
            command: ["vllm", "serve"]
            args:
              - "zai-org/GLM-5-FP8"
              - "--tensor-parallel-size=8"
              - "--pipeline-parallel-size=2"
              - "--gpu-memory-utilization=0.92"
              - "--enable-prefix-caching"
            env:
              - name: VLLM_USE_DEEP_GEMM
                value: "1"
              - name: NCCL_DEBUG
                value: "INFO"
            resources:
              requests:
                nvidia.com/gpu: "8"
            volumeMounts:
              - name: model-cache
                mountPath: /models
              - name: dshm
                mountPath: /dev/shm
        volumes:
          - name: model-cache
            emptyDir:
              sizeLimit: 1Ti
          - name: dshm
            emptyDir:
              medium: Memory
              sizeLimit: 32Gi
    workerTemplate:
      spec:
        # Same container spec as leader (only node-rank differs in args)
        tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
        containers:
          - name: vllm
            image: vllm/vllm-openai:v0.18.1
            command: ["vllm", "serve"]
            args:
              - "zai-org/GLM-5-FP8"
              - "--tensor-parallel-size=8"
              - "--pipeline-parallel-size=2"
              - "--gpu-memory-utilization=0.92"
              - "--enable-prefix-caching"
            env:
              - name: VLLM_USE_DEEP_GEMM
                value: "1"
            resources:
              requests:
                nvidia.com/gpu: "8"
            volumeMounts:
              - name: model-cache
                mountPath: /models
              - name: dshm
                mountPath: /dev/shm
        volumes:
          - name: model-cache
            emptyDir:
              sizeLimit: 1Ti
          - name: dshm
            emptyDir:
              medium: Memory
              sizeLimit: 32Gi

NCCL / EFA Network Optimization

Inter-node communication performance is critical in multi-node pipeline parallelization. p5.48xlarge provides 3,200 Gbps EFA (Elastic Fabric Adapter).

# NCCL environment variable optimization (add to LWS Pod)
env:
  - name: NCCL_DEBUG
    value: "INFO"
  - name: FI_PROVIDER
    value: "efa"
  - name: FI_EFA_USE_DEVICE_RDMA
    value: "1"
  - name: NCCL_ALGO
    value: "Ring"           # Ring is suitable for multi-node PP
  - name: NCCL_PROTO
    value: "Simple"         # Stable on EFA
  - name: NCCL_MIN_NCHANNELS
    value: "4"

LWS Failure Recovery

Setting restartPolicy: RecreateGroupOnPodRestart recreates the entire group when either Leader or Worker Pod fails. Multi-node NCCL communication requires all nodes to be synchronized, making full restart more stable than partial restart.

References

Official Documentation

LeaderWorkerSet GitHub — K8s native multi-node workload
NVIDIA Dynamo Disaggregated Serving — Prefill/Decode separation design
Elastic Fabric Adapter (EFA) — p5.48xlarge 3,200Gbps RDMA
NCCL Tuning Guide — Multi-node communication optimization

Papers & Technical Blogs

DistServe (OSDI 2024) — "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving"
Splitwise Paper (Microsoft) — "Splitwise: Efficient Generative LLM Inference Using Phase Splitting"
llm-d Disaggregated Design — llm-d disaggregated serving architecture
NIXL Overview (NVIDIA) — Common KV transfer engine

KV Cache Optimization (vLLM Deep Dive + Cache-Aware Routing) — vLLM parallelization strategies
GPU Resources · Observability · Hybrid Node · Lessons Learned — NodePool-based autoscaling
MoE Model Serving Guide — MoE model deployment
llm-d-based EKS Distributed Inference — llm-d deployment guide

Overview​