Skip to content

Dynamo Platform

NVIDIA Dynamo Platform provides orchestration, scheduling, and lifecycle management for LLM serving workloads. It includes CRDs, Operator, etcd, NATS, Grove KV Router, and KAI Scheduler.

Category nvidia-platform
Official Docs NVIDIA Dynamo Platform
CLI Install ./cli nvidia-platform dynamo-platform install
CLI Uninstall ./cli nvidia-platform dynamo-platform uninstall
Namespace dynamo-system

Overview

The Dynamo Platform is the control plane for NVIDIA's LLM serving infrastructure. It manages: - DynamoGraphDeployment (DGD): Declarative model deployment with aggregated/disaggregated modes - DynamoGraphDeploymentRequest (DGDR): SLA-driven auto-configuration and deployment - DynamoWorkerMetadata: Worker state tracking and discovery - Grove: KV-aware request routing - KAI Scheduler: Intelligent GPU scheduling - etcd: Service discovery and state storage - NATS: Internal messaging bus

Installation

./cli nvidia-platform dynamo-platform install

Auto-Configuration

The installer automatically detects and configures:

Component Detection Action
Prometheus monitoring installed Auto-configure prometheusEndpoint
GPU Operator Detected Status check only
StorageClass (K8s) local-path not found Auto-install local-path-provisioner
Ingress (K8s) Not found Prompt to install
EFS (EKS) Not found Auto-install EFS CSI Driver + StorageClass

Platform-Specific Configuration

K8s Mode: - Uses nfs StorageClass for model cache PVC (ReadWriteMany) - Uses local-path for etcd/NATS persistence (ReadWriteOnce) - Prompts to install ingress-nginx if not found

EKS Mode: - Uses efs StorageClass for model cache PVC (ReadWriteMany) - Auto-installs EFS CSI Driver if not found - ALB Ingress annotations auto-added

Verification

# Check Dynamo Platform pods
kubectl get pods -n dynamo-system

# Check CRDs
kubectl get crds | grep nvidia.com

# Check etcd cluster
kubectl get pods -n dynamo-system -l app.kubernetes.io/name=etcd

# Check NATS
kubectl get pods -n dynamo-system -l app.kubernetes.io/name=nats

# Check Dynamo Operator
kubectl logs -n dynamo-system -l app=dynamo-operator

Expected CRDs: - dynamographdeployments.nvidia.com - dynamographdeploymentrequests.nvidia.com - dynamoworkermetadatas.nvidia.com

Configuration

Configuration is managed through config.json:

{
  "platform": {
    "k8s": {
      "storageClass": "nfs",
      "dynamoPlatform": {
        "releaseVersion": "0.9.0-post1",
        "namespace": "dynamo-system",
        "groveEnabled": true,
        "kaiSchedulerEnabled": true
      }
    },
    "eks": {
      "storageClass": "efs",
      "dynamoPlatform": {
        "releaseVersion": "0.9.1",
        "namespace": "dynamo-system",
        "groveEnabled": true,
        "kaiSchedulerEnabled": true
      }
    }
  }
}

Storage Architecture

The Dynamo Platform uses two different StorageClasses:

StorageClass Purpose Access Mode Used By
nfs / efs Model cache PVC ReadWriteMany Dynamo vLLM (model download)
local-path etcd, NATS persistence ReadWriteOnce etcd, NATS StatefulSets

Components

Dynamo Operator

The Operator reconciles DynamoGraphDeployment and DynamoGraphDeploymentRequest resources: - Creates Frontend + Worker Deployments/StatefulSets - Configures Prometheus PodMonitors - Sets worker environment variables (DYN_SYSTEM_PORT, structured logging) - Manages service discovery (etcd or kubernetes)

etcd Cluster

Distributed key-value store for: - Service discovery (preferred over kubernetes-native for KVBM stability) - Worker metadata and health state - Configuration coordination

NATS

Lightweight messaging system for: - Internal component communication - Event-driven orchestration - Worker state updates

Grove KV Router

KV-aware request routing: - Routes requests to workers with cached KV blocks - Improves Time to First Token (TTFT) - Configurable temperature and overlap scoring

KAI Scheduler

Intelligent GPU scheduling for: - Multi-instance GPU (MIG) workloads - Fractional GPU allocation - Resource optimization

Discovery Backends

Dynamo Platform supports two discovery backends:

Backend Description Use Case
etcd External etcd cluster Recommended for KVBM stability in multi-replica disaggregated mode
kubernetes K8s-native discovery Simpler setup, but may cause KVBM handshake failures

The CLI defaults to etcd for all deployments. DGDR templates include nvidia.com/dynamo-discovery-backend: etcd annotation.

Integration with Monitoring

When Monitoring is installed before Dynamo Platform: 1. Installer detects Prometheus Service 2. Sets --set prometheusEndpoint=http://prometheus-kube-prometheus-prometheus.monitoring:9090 3. Dynamo Operator auto-creates PodMonitors 4. Worker metrics endpoint auto-configured at :9090/metrics

Known Issues

CRD Chart Version Mismatch (v0.9.0)

Issue Workaround
CRD chart is v0.9.0, Platform chart is v0.9.0-post1 CLI strips -postN suffix when fetching CRD chart
DynamoWorkerMetadata CRD missing from v0.9.0 chart CLI applies bundled CRD after Helm install

Discovery Backend + KVBM (v0.9.0)

Issue Workaround
discoveryBackend: kubernetes causes KVBM handshake failures Platform defaults to discoveryBackend: etcd

Learn More