Skip to main content

NVIDIA GPU Software Stack

📅 Created: 2026-03-20 | Updated: 2026-03-20 | ⏱️ Reading Time: ~10 minutes

Overview

The NVIDIA GPU software stack is structured in three layers for efficient GPU operations in Kubernetes environments. GPU Operator (driver and infrastructure automation) connects GPUs to Kubernetes, DCGM (Data Center GPU Manager) monitors GPU status, and Run:ai handles GPU orchestration at the top layer. This document covers the configuration and operation of each layer, along with MIG/Time-Slicing partitioning strategies and the NVIDIA Dynamo distributed inference framework.


GPU Operator Architecture

GPU Operator Latest Version (v25.10.1, as of March 2026)
ComponentVersionRole
GPU Operatorv25.10.1Full GPU stack lifecycle management
NVIDIA Driver580.126.18GPU kernel driver
DCGMv4.5.2GPU monitoring engine
DCGM Exporterv4.5.2-4.8.1Prometheus metrics exposure
Device Pluginv0.19.0K8s GPU resource registration
GFD (GPU Feature Discovery)v0.19.0GPU node labeling
MIG Managerv0.13.1MIG partition auto-management
Container Toolkit (CDI)v1.17.5Container GPU runtime

v25.10.1 Key Features:

  • Blackwell Architecture Support: Full support for B200/GB200 GPUs
  • HPC Job Mapping: GPU job-level metrics collection and accounting
  • CDMM (Confidential Data & Model Management): GPU support for Confidential Computing environments
  • CDI (Container Device Interface): Container runtime-independent device management

3-Layer Architecture

Role of each layer:

  • GPU Operator (Orchestrator): Orchestration layer that bundles the entire GPU stack via ClusterPolicy CRD. Each component (Driver, Container Toolkit, Device Plugin, DCGM Exporter, NFD, GFD, MIG Manager) can be independently enabled/disabled. Can be installed on EKS Auto Mode — only Device Plugin is disabled via node labels while other components (DCGM Exporter, NFD, GFD, etc.) operate normally.
  • DCGM (Sensor): Monitoring engine that reads GPU status. Collects SM Utilization, Tensor Core Activity, Memory, Power, Temperature, ECC Errors, etc.
  • Run:ai (Control Tower): Scheduling/management layer operating on top of GPU Operator and DCGM. Provides Fractional GPU, Dynamic MIG, Gang Scheduling, and Quota management.

Dependencies

CombinationPossibleUse Case
GPU Operator OnlyYesBasic GPU inference, manual MIG setup, DCGM metrics
GPU Operator + Run:aiYesEnterprise GPU cluster management (recommended)
DCGM Only (manual driver install)YesBare metal, single server monitoring
Run:ai Only (without GPU Operator)NoGPU Operator ClusterPolicy is required dependency
EKS Auto Mode + Run:aiYesInstall GPU Operator, disable Device Plugin via label

GPU Management by EKS Environment

Node TypeGPU DriverGPU OperatorMIG SupportRun:ai Support
Auto ModeAWS auto-installInstallable (Device Plugin disabled via label)Not supportedSupported (Device Plugin disabled via label)
Karpenter (Self-Managed)GPU Operator installFull supportFull supportFull support
Managed Node GroupGPU Operator installFull supportFull supportFull support
Hybrid Node (on-premises)GPU Operator requiredRequiredFull supportFull support
Node Strategy Detailed Guide

For detailed information on hybrid configuration of EKS Auto Mode and Karpenter, GPU Operator installation methods, and Hybrid Node GPU farm setup, refer to EKS GPU Node Strategy.

GPU Operator Component Details

GPU Operator manages the entire GPU stack declaratively through the ClusterPolicy CRD.

GPU Driver Constraints by AMI
  • AL2023 / Bottlerocket: GPU drivers are pre-installed in the AMI, so GPU Operator's driver component must be set to enabled: false.
  • AL2 (Custom AMI): GPU Operator can install drivers directly.
  • EKS Auto Mode: AWS manages drivers automatically, so both driver and toolkit must be set to enabled: false.

Helm Installation Example (Karpenter + Self-Managed):

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator (AL2023/Bottlerocket — disable driver/toolkit)
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace \
--set driver.enabled=false \
--set toolkit.enabled=false \
--set dcgmExporter.serviceMonitor.enabled=true \
--set migManager.enabled=true \
--set gfd.enabled=true \
--set nfd.enabled=true

ClusterPolicy CRD Example:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
operator:
defaultRuntime: containerd
driver:
enabled: false # AL2023/Bottlerocket: pre-installed in AMI
toolkit:
enabled: false # AL2023/Bottlerocket: pre-installed in AMI
devicePlugin:
enabled: true
dcgmExporter:
enabled: true
version: "4.5.2-4.8.1"
serviceMonitor:
enabled: true
migManager:
enabled: true
config:
name: default-mig-parted-config
gfd:
enabled: true
nfd:
enabled: true
nodeStatusExporter:
enabled: false

EKS Auto Mode NodePool Labels (Device Plugin Disabled):

On Auto Mode, install GPU Operator but disable only the Device Plugin via node labels. GPU Operator installation is required for projects like KAI Scheduler that depend on ClusterPolicy.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-auto-mode
spec:
template:
metadata:
labels:
nvidia.com/gpu.deploy.device-plugin: "false" # Disable Device Plugin
spec:
requirements:
- key: eks.amazonaws.com/instance-family
operator: In
values: ["p5", "p4d"]
nodeClassRef:
group: eks.amazonaws.com
kind: NodeClass
name: default

GPU Operator Configuration by EKS Environment

EnvironmentGPU OperatorDriver ManagementMIGLimitations
EKS Auto ModeInstallable (Device Plugin disabled)AWS automatic (AMI pre-installed)Not supportedDevice Plugin disabled via label, DCGM/NFD/GFD operate normally
EKS + KarpenterHelm installOperator managedFull supportGPU AMI required in NodePool
EKS Managed Node GroupHelm installOperator managedFull supportNode group-level management
EKS Hybrid NodesHelm install (required)Operator requiredFull supportOn-premises GPU farm, network setup required

DCGM Monitoring

NVIDIA DCGM (Data Center GPU Manager) is a core component that monitors GPU status and exposes metrics to Prometheus.

Deployment Method Selection

DCGM Exporter can be deployed as DaemonSet or Sidecar. DaemonSet is recommended for most production environments.

ItemDescription
Resource Efficiency1 instance per node -- minimal overhead
ManagementCentralized, auto-managed by GPU Operator
Metrics ScopeCollects all GPU metrics on the node
SecurityOnly DaemonSet needs SYS_ADMIN
Suitable EnvironmentProduction environments (most cases)

Valid Sidecar Scenarios:

  • Multi-tenant billing: Need to precisely track GPU usage by tenant at Pod level
  • Cannot install DaemonSet: Environments with limited node access like EKS Auto Mode
  • Pod isolation: Need to independently monitor only specific Pod's GPU metrics
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: gpu-monitoring
labels:
app: dcgm-exporter
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:4.5.2-4.8.1-ubuntu22.04
ports:
- name: metrics
containerPort: 9400
env:
- name: DCGM_EXPORTER_LISTEN
value: ":9400"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_COLLECTORS
value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
volumeMounts:
- name: pod-resources
mountPath: /var/lib/kubelet/pod-resources
readOnly: true
securityContext:
runAsNonRoot: false
runAsUser: 0
capabilities:
add: ["SYS_ADMIN"]
volumes:
- name: pod-resources
hostPath:
path: /var/lib/kubelet/pod-resources
DCGM Exporter 3.3+ Features

DCGM Exporter 3.3+ provides the following enhanced features:

  • H100/H200 Support: Latest GPU metrics collection
  • Enhanced Metrics: More granular GPU status monitoring
  • Performance Improvements: Metrics collection with lower overhead

Sidecar Deployment (Special Purpose)

Use Kubernetes 1.33+'s stabilized Sidecar Containers to collect GPU metrics at Pod level.

apiVersion: v1
kind: Pod
metadata:
name: vllm-with-monitoring
namespace: ai-inference
spec:
initContainers:
# DCGM Exporter running as Sidecar (special purpose)
- name: dcgm-sidecar
image: nvcr.io/nvidia/k8s/dcgm-exporter:4.5.2-4.8.1-ubuntu22.04
restartPolicy: Always # K8s 1.33+ Sidecar feature
ports:
- name: metrics
containerPort: 9400
securityContext:
capabilities:
add: ["SYS_ADMIN"]
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "200m"
memory: "256Mi"
containers:
- name: vllm
image: vllm/vllm-openai:latest
resources:
requests:
nvidia.com/gpu: 2
limits:
nvidia.com/gpu: 2

Key GPU Metrics

Core metrics collected by DCGM Exporter.

Metric NameDescriptionScaling Usage
DCGM_FI_DEV_GPU_UTILGPU core utilization (%)HPA trigger criterion
DCGM_FI_DEV_MEM_COPY_UTILMemory bandwidth utilization (%)Memory bottleneck detection
DCGM_FI_DEV_FB_USEDFramebuffer usage (MB)OOM prevention
DCGM_FI_DEV_FB_FREEFramebuffer available (MB)Capacity planning
DCGM_FI_DEV_POWER_USAGEPower usage (W)Cost monitoring
DCGM_FI_DEV_SM_CLOCKSM clock speed (MHz)Performance monitoring
DCGM_FI_DEV_GPU_TEMPGPU temperature (C)Thermal management

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
endpoints:
- port: metrics
interval: 15s
path: /metrics
namespaceSelector:
matchNames:
- gpu-monitoring

GPU Partitioning Strategies

MIG (Multi-Instance GPU) Based Partitioning

MIG divides H100, A100, H200, and other Ampere/Hopper/Blackwell architecture GPUs into up to 7 independent GPU instances. Each MIG instance has isolated memory, cache, and SM (Streaming Multiprocessor), ensuring stable performance without workload interference.

GPU Operator's MIG Manager automatically manages MIG profiles based on ConfigMap.

# MIG Profile ConfigMap (mig-parted format)
apiVersion: v1
kind: ConfigMap
metadata:
name: default-mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
# 7 small instances: multi-serving of small models
all-1g.5gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.5gb": 7

# Mixed configuration: simultaneous large + small operation
mixed-balanced:
- devices: all
mig-enabled: true
mig-devices:
"3g.20gb": 1
"2g.10gb": 1
"1g.5gb": 2

# Single large instance: 70B+ models
single-7g:
- devices: all
mig-enabled: true
mig-devices:
"7g.40gb": 1

---

# Apply MIG profile (select by node label)
# kubectl label node gpu-node-01 nvidia.com/mig.config=mixed-balanced

---

# Use MIG device in Pod
apiVersion: v1
kind: Pod
metadata:
name: vllm-mig-inference
namespace: ai-inference
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model"
- "meta-llama/Llama-2-7b-hf"
- "--gpu-memory-utilization"
- "0.9"
resources:
requests:
memory: "4Gi"
cpu: "4"
nvidia.com/mig-1g.5gb: 1 # Specify MIG profile
limits:
nvidia.com/mig-1g.5gb: 1

A100 40GB MIG Profiles:

ProfileMemorySM CountUse CaseExpected Throughput
1g.5gb5GB14Small models (3B or less)~20 tok/s
1g.10gb10GB14Small models (3B-7B)~25 tok/s
2g.10gb10GB28Medium models (7B-13B)~50 tok/s
3g.20gb20GB42Medium-large models (13B-30B)~100 tok/s
4g.20gb20GB56Large models (13B-30B)~130 tok/s
7g.40gb40GB84Extra-large models (70B+)~200 tok/s

Time-Slicing Based Partitioning

Time-Slicing divides GPU computing time on a time basis, allowing multiple Pods to share the same GPU. Unlike MIG, it is available on all NVIDIA GPUs but lacks memory isolation between workloads and performance degradation occurs during concurrent execution.

Configure Time-Slicing via GPU Operator's ClusterPolicy or ConfigMap.

# Time-Slicing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4 # Each GPU shared by 4 Pods

---

# Enable Time-Slicing in ClusterPolicy
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
devicePlugin:
enabled: true
config:
name: time-slicing-config # Reference ConfigMap
default: any

---

# Use Time-Sliced GPU in Pod (same as regular GPU request)
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-timeslice-replicas
namespace: ai-inference
spec:
replicas: 3 # 3 Pods share the same GPU
selector:
matchLabels:
app: vllm-slice
template:
metadata:
labels:
app: vllm-slice
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
resources:
requests:
nvidia.com/gpu: 1 # GPU slice allocated when Time-Slicing enabled
memory: "8Gi"
cpu: "2"
limits:
nvidia.com/gpu: 1

Time-Slicing Performance Considerations:

Time-Slicing Performance Characteristics
  • Context Switching Overhead: Minimal at ~1% level
  • Concurrent Execution Performance Degradation: Shares GPU memory and compute, resulting in 50-100% performance degradation depending on concurrent workload count
  • No Memory Isolation: Unlike MIG, GPU memory not isolated between workloads, so one workload's OOM affects others
Use CaseSuitabilityReason
Batch inference, non-urgent tasksSuitableSequential execution minimizes performance impact
Development/test environmentsSuitableGPU cost savings, performance guarantee unnecessary
Real-time inference (with SLA)UnsuitableUnpredictable latency during concurrent workloads
High-performance trainingUnsuitableRequires full GPU memory utilization

NVIDIA Dynamo: Datacenter-Scale Inference Optimization

Overview

NVIDIA Dynamo is an open-source framework that optimizes LLM inference at datacenter scale. It supports vLLM, SGLang, and TensorRT-LLM as backends, achieving up to 7x performance improvement over existing solutions in the SemiAnalysis InferenceX benchmark.

Dynamo v1.0 (2026.03 GA)
  • Supported Backends: vLLM, SGLang, TensorRT-LLM
  • Serving Modes: Both Aggregated + Disaggregated equally supported
  • Core Technologies: Flash Indexer (radix tree KV indexing), NIXL (common KV transfer), KAI Scheduler (GPU-aware Pod placement), Planner (SLO-based autoscaling), EPP (Gateway API integration)
  • Deployment: Kubernetes Operator + CRD based
  • License: Apache 2.0

Core Architecture

Dynamo equally supports both Aggregated Serving and Disaggregated Serving. In Disaggregated mode, it separates Prefill (prompt processing) and Decode (token generation) for independent scaling per stage. The latest release introduces radix tree-based Flash Indexer for indexing KV cache per worker to optimize prefix matching.

Core Components

ComponentRoleBenefits
Disaggregated ServingSeparate Prefill/Decode workers (Aggregated also supported)Independent scaling per stage, maximize GPU utilization
KV Cache RoutingPrefix-aware request routingImprove KV Cache hit rate, reduce TTFT
Flash IndexerRadix tree-based KV cache indexing per workerOptimize prefix matching, maximize KV reuse rate
KVBM (KV Block Manager)GPU → CPU → SSD 3-tier cacheMaximize memory efficiency, support large contexts
NIXLNVIDIA Inference Transfer Library (common KV transfer engine)Ultra-fast KV Cache transfer between GPUs (NVLink/RDMA). Used by Dynamo, llm-d, production-stack, aibrix, and most other projects
KAI SchedulerGPU-aware K8s Pod schedulerGPU topology, MIG slice-aware Pod placement. Depends on ClusterPolicy
PlannerSLO-based autoscalingRun profiling → supply results to Planner → automatic scaling based on SLO targets
EPP (Endpoint Picker Protocol)Gateway API integrationDynamo's own EPP implementation for native K8s Gateway API integration

EKS Deployment

Dynamo is deployed to EKS using the Kubernetes Operator pattern.

Installation Steps:

# 1. Monitoring stack (Prometheus + Grafana)
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace

# 2. GPU Operator (skip if already installed)
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace

# 3. Dynamo Platform (Operator + etcd + NATS)
helm install dynamo-platform nvidia/dynamo-platform \
--namespace dynamo-system --create-namespace

# 4. Deploy Dynamo vLLM workload
kubectl apply -f dynamo-vllm-deployment.yaml

DynamoGraphDeploymentRequest (DGDR) CRD-based deployment:

apiVersion: dynamo.nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: llama-70b-disagg
namespace: ai-inference
spec:
graph:
name: disaggregated-llm
engine: vllm
model: meta-llama/Llama-3.1-70B-Instruct
serving:
mode: disaggregated # aggregated | disaggregated
prefill:
replicas: 2
resources:
nvidia.com/gpu: 4
decode:
replicas: 4
resources:
nvidia.com/gpu: 2
routing:
strategy: prefix-aware
kvCacheRouting: true
sla:
maxTTFT: 500ms
maxITL: 50ms
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 16
targetUtilization: 70

AIConfigurator

Dynamo's AIConfigurator automatically recommends optimal Tensor Parallelism (TP) and Pipeline Parallelism (PP) settings based on model and hardware configuration.

FeatureDescription
Automatic TP/PP RecommendationOptimal parallelization based on model size, GPU memory, network topology
Pareto FrontierFind optimal points on throughput-latency tradeoff
Hardware ProfilingAuto-detect GPU-to-GPU bandwidth, NVLink topology
SLA-based OptimizationConfiguration recommendations based on target TTFT/ITL

llm-d vs Dynamo Selection Guide

Both llm-d and NVIDIA Dynamo handle LLM inference routing/scheduling, but they are alternatives and should be selected rather than used together.

Feature Comparison

Itemllm-dNVIDIA Dynamo
ArchitectureAggregated + DisaggregatedAggregated + Disaggregated (equally supported)
KV Cache RoutingPrefix-aware routingPrefix-aware + Flash Indexer (radix tree)
KV Cache TransferNIXL (network also supported)NIXL (NVLink/RDMA ultra-fast transfer)
RoutingGateway API + Envoy EPPDynamo Router + own EPP (Gateway API integration)
Pod SchedulingDefault K8s scheduler (no built-in)KAI Scheduler (GPU-aware Pod placement)
AutoscalingHPA/KEDA integrationPlanner (SLO-based: profiling → autoscale) + KEDA/HPA
vLLM BackendSupportedSupported (also SGLang, TRT-LLM)
Kubernetes IntegrationGateway API nativeOperator + CRD (DGDR) + Gateway API EPP
ComplexityLow -- add router to existing vLLMHigh -- replace entire serving stack
Performance GainReduce TTFT via prefix hitFlash Indexer + Disaggregated for up to 7x throughput
Maturityv0.5+v1.0 GA (2026.03)

Why Difficult to Use Together

Both act as routers that decide which backend to send requests to. Since they compete at the routing layer, connecting two routers serially makes no sense.

llm-d alone:    Client → llm-d Router → vLLM Workers (Aggregated or Prefill/Decode separated)
Dynamo alone: Client → Dynamo Router → Prefill Workers → (NIXL) → Decode Workers
Dynamo + llm-d Integration Possibility

Dynamo 1.0 can integrate llm-d as an internal component. In this case, llm-d acts not as an independent router but as Dynamo's KV Cache-aware routing layer. Rather than being complete alternatives, Dynamo can be viewed as a superset containing llm-d.

Selection Criteria

ScenarioRecommendation
Add routing only to existing vLLM deploymentllm-d
Small to medium scale (8 GPUs or fewer)llm-d
Gateway API-based K8s native routingllm-d
Large scale (16+ GPUs), maximize throughputDynamo
Need NIXL-based ultra-fast KV transfer between GPUsDynamo
Long context (128K+) workloadsDynamo (NIXL + 3-tier KV cache)
Fast adoption, low operational complexityllm-d
Migration Path

Starting with llm-d and transitioning to Dynamo as scale grows is practical. Both use vLLM as backend and leverage NIXL for KV transfer. Key differences are Dynamo's Flash Indexer (radix tree KV indexing), KAI Scheduler (GPU-aware Pod placement), and Planner (SLO-based autoscaling).


Summary

The NVIDIA GPU software stack consists of three layers: GPU Operator (infrastructure automation), DCGM (monitoring), and Run:ai (orchestration). GPUs can be efficiently partitioned through MIG and Time-Slicing, and NVIDIA Dynamo can be used to optimize LLM inference at datacenter scale.

Next Steps


References