Skip to main content

NVIDIA GPU Stack

The NVIDIA GPU software stack is organized in a layered structure for operating GPUs in Kubernetes environments.

LayerRoleCore Component
Infrastructure AutomationDeclaratively manage GPU drivers, runtimes, and pluginsGPU Operator (ClusterPolicy CRD)
MonitoringCollect GPU state and expose Prometheus metricsDCGM, DCGM Exporter
PartitioningShare a single GPU across multiple workloadsMIG, Time-Slicing
Inference OptimizationDatacenter-scale LLM servingDynamo, KAI Scheduler

This document covers the architecture and design decision criteria for each component. For GPU node provisioning (Karpenter), scaling (KEDA), and cost optimization, see GPU Resource Management.


GPU Operator Architecture

Concept

GPU Operator is an orchestration layer that bundles the entire GPU stack under a single ClusterPolicy CRD. Each component can be independently enabled/disabled, and GPU environments are automatically configured when nodes are added.

GPU Operator v25.10.1 (as of 2026.03)
ComponentVersionRole
GPU Operatorv25.10.1GPU stack lifecycle management
NVIDIA Driver580.126.18GPU kernel driver
DCGMv4.5.2GPU monitoring engine
DCGM Exporterv4.5.2-4.8.1Prometheus metrics exposure
Device Pluginv0.19.0K8s GPU resource registration
GFDv0.19.0GPU node labeling
MIG Managerv0.13.1MIG partition auto-management
Container Toolkit (CDI)v1.17.5Container GPU runtime

v25.10.1 Key New Features: Blackwell (B200/GB200) support, HPC Job Mapping, CDMM (Confidential Computing), CDI (Container Device Interface)

Component Structure

Component Roles:

  • Driver DaemonSet: Installs GPU kernel driver on nodes. Set enabled: false for AL2023/Bottlerocket as it's pre-installed in the AMI
  • Container Toolkit (CDI): Injects GPU devices into container runtime. CDI (Container Device Interface)-based for runtime independence
  • Device Plugin: Registers nvidia.com/gpu extended resource with kubelet. Enables kube-scheduler to place GPU Pods
  • GFD (GPU Feature Discovery): Exposes GPU model, driver version, MIG profiles as node labels. Used for nodeSelector/nodeAffinity
  • NFD (Node Feature Discovery): Exposes hardware features (CPU, PCIe, NUMA, etc.) as node labels
  • MIG Manager: Auto-applies MIG profiles based on ConfigMap. Reconfigures on node label changes
  • DCGM Exporter: Exposes DCGM metrics in Prometheus format

GPU Operator Configuration per EKS Environment

EnvironmentDriverToolkitDevice PluginMIGNotes
EKS Auto Mode❌ (AWS auto)❌ (AWS auto)❌ (disabled via label)DCGM/NFD/GFD work normally
Karpenter (Self-Managed)❌ (AL2023 AMI)❌ (AL2023 AMI)Full support
Managed Node Group❌ (AL2023 AMI)❌ (AL2023 AMI)Full support
Hybrid Node (On-premises)✅ (required)✅ (required)GPU Operator required
AMI-specific GPU Driver Constraints
  • AL2023 / Bottlerocket: GPU driver pre-installed in AMI. Both driver and toolkit must be enabled: false
  • EKS Auto Mode: AWS auto-manages drivers. Device Plugin disabled via node label nvidia.com/gpu.deploy.device-plugin: "false"

GPU Operator on EKS Auto Mode

On Auto Mode, AWS manages GPU drivers and Device Plugin, but GPU Operator installation is still useful when:

  • DCGM Exporter: GPU metrics collection (Auto Mode itself does not provide DCGM)
  • GFD/NFD: Per-GPU-model node labeling for nodeSelector usage
  • KAI Scheduler: Compatibility with projects depending on ClusterPolicy
# Auto Mode NodePool — Only Device Plugin disabled via label
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-auto-mode
spec:
template:
metadata:
labels:
nvidia.com/gpu.deploy.device-plugin: "false"
spec:
requirements:
- key: eks.amazonaws.com/instance-family
operator: In
values: ["p5", "p4d"]
nodeClassRef:
group: eks.amazonaws.com
kind: NodeClass
name: default

DCGM Monitoring

Overview

NVIDIA DCGM (Data Center GPU Manager) is a monitoring engine that collects GPU state and exposes metrics to Prometheus. GPU Operator automatically deploys DCGM Exporter as a DaemonSet.

Deployment Method Selection

ItemDetails
Resource Efficiency1 instance per node — minimal overhead
ManagementAuto-managed by GPU Operator
Metrics ScopeCollects all GPU metrics on the node
Suitable EnvironmentProduction environments (most cases)

Key GPU Metrics

MetricDescriptionUsage
DCGM_FI_DEV_GPU_UTILGPU core utilization (%)HPA/KEDA trigger
DCGM_FI_DEV_MEM_COPY_UTILMemory bandwidth utilization (%)Memory bottleneck detection
DCGM_FI_DEV_FB_USED / FB_FREEFramebuffer used/free (MB)OOM prevention, capacity planning
DCGM_FI_DEV_POWER_USAGEPower usage (W)Cost and thermal management
DCGM_FI_DEV_GPU_TEMPGPU temperature (C)Thermal throttling prevention
DCGM_FI_DEV_SM_CLOCKSM clock speed (MHz)Performance monitoring

Prometheus Integration Concept

DCGM Exporter exposes Prometheus-format metrics at the :9400/metrics endpoint. Setting dcgmExporter.serviceMonitor.enabled=true during GPU Operator installation auto-creates the ServiceMonitor.

Collection chain:

GPU Hardware → DCGM Engine → DCGM Exporter (:9400) → Prometheus → Grafana/KEDA

Key design decisions:

  • Collection interval: 15s (default). For LLM serving, 10s recommended
  • Metrics filtering: Control cardinality by collecting only needed metrics via /etc/dcgm-exporter/dcp-metrics-included.csv
  • Pod-GPU mapping: Setting DCGM_EXPORTER_KUBERNETES=true adds pod, namespace, container labels to metrics

GPU Partitioning Strategies

MIG (Multi-Instance GPU)

MIG partitions Ampere/Hopper/Blackwell architecture GPUs (A100, H100, H200, B200) into up to 7 hardware-isolated GPU instances. Each MIG instance has independent memory, cache, and SM (Streaming Multiprocessor), guaranteeing stable performance without inter-workload interference.

MIG Core Value:

  • Hardware isolation: Memory, SM, L2 cache completely separated for QoS guarantee
  • Concurrent execution: Multiple inference workloads run simultaneously without performance degradation
  • GPU Operator auto-management: MIG Manager auto-applies profiles based on ConfigMap

A100 40GB MIG Profiles:

ProfileMemorySM CountUse CaseExpected Throughput
1g.5gb5GB14Small models (3B and below)~20 tok/s
1g.10gb10GB14Small models (3B-7B)~25 tok/s
2g.10gb10GB28Medium models (7B-13B)~50 tok/s
3g.20gb20GB42Medium-large models (13B-30B)~100 tok/s
4g.20gb20GB56Large models (13B-30B)~130 tok/s
7g.40gb40GB84Full GPU (70B+)~200 tok/s

MIG Profile Management:

GPU Operator's MIG Manager watches node labels (nvidia.com/mig.config) and auto-applies MIG profiles. Define profiles in a ConfigMap, and MIG Manager reconfigures GPUs when node labels change.

# MIG Profile ConfigMap (mig-parted format)
apiVersion: v1
kind: ConfigMap
metadata:
name: default-mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-1g.5gb: # 7 small instances
- devices: all
mig-enabled: true
mig-devices:
"1g.5gb": 7
mixed-balanced: # Mixed configuration
- devices: all
mig-enabled: true
mig-devices:
"3g.20gb": 1
"2g.10gb": 1
"1g.5gb": 2
single-7g: # Single large
- devices: all
mig-enabled: true
mig-devices:
"7g.40gb": 1

Pods request MIG devices using nvidia.com/mig-<profile> resources.

resources:
requests:
nvidia.com/mig-1g.5gb: 1
limits:
nvidia.com/mig-1g.5gb: 1

Time-Slicing

Time-Slicing shares GPU computing time across multiple Pods based on time division. Unlike MIG, it's available on all NVIDIA GPUs but lacks inter-workload memory isolation.

Configuration:

GPU Operator's ClusterPolicy references a ConfigMap to enable Time-Slicing.

# Time-Slicing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # Each GPU shared by 4 Pods

Pods request nvidia.com/gpu: 1 as usual. On Time-Slicing-enabled nodes, a GPU slice is allocated.

MIG vs Time-Slicing Comparison

ItemMIGTime-Slicing
Isolation levelHardware isolation (memory, SM, cache)Software time-sharing (no isolation)
Supported GPUsA100, H100, H200, B200All NVIDIA GPUs
Max partitions7 instancesUnlimited (performance degrades proportionally)
Performance predictabilityGuaranteed (QoS)Varies with concurrent workload count
Memory safetyOOM does not affect other instancesOOM affects other workloads
Suitable environmentProduction inference, multi-tenantDev/test, batch inference
Time-Slicing Performance Characteristics
  • Context switching overhead: ~1% level, negligible
  • Concurrent execution degradation: 50-100% performance drop as GPU memory and compute are shared
  • No memory isolation: One workload's OOM affects others
  • Suitable: Batch inference, dev/test environments | Not suitable: Real-time inference (SLA), high-performance training

Dynamo: Datacenter-Scale Inference Optimization

Overview

NVIDIA Dynamo is an open-source framework that optimizes datacenter-scale LLM inference. It supports vLLM, SGLang, and TensorRT-LLM as backends, achieving up to 7x performance improvement over baselines.

Dynamo v1.0 GA (2026.03)
  • Serving modes: Aggregated + Disaggregated equally supported
  • Core technologies: Flash Indexer, NIXL, KAI Scheduler, Planner, EPP
  • Deployment: Kubernetes Operator + CRD (DGDR)
  • License: Apache 2.0

Core Architecture

Dynamo supports both Aggregated Serving and Disaggregated Serving. In Disaggregated mode, Prefill (prompt processing) and Decode (token generation) are separated for independent scaling.

Core Components

ComponentRoleBenefit
Disaggregated ServingSeparate Prefill/Decode workersIndependent per-phase scaling, maximize GPU utilization
Flash IndexerRadix tree-based per-worker KV cache indexingPrefix matching optimization, maximize KV reuse
KVBMGPU → CPU → SSD 3-tier cacheMaximize memory efficiency, support large-scale contexts
NIXLNVIDIA Inference Transfer LibraryUltra-fast GPU-to-GPU KV Cache transfer (NVLink/RDMA). Shared by Dynamo, llm-d, production-stack, aibrix
PlannerSLO-based autoscalingProfiling → SLO target-based automatic Prefill/Decode scaling
EPPEndpoint Picker ProtocolNative integration with K8s Gateway API
AIConfiguratorAuto TP/PP recommendationOptimal parallelization based on model size, GPU memory, network topology

llm-d Selection Guide

llm-d and Dynamo both handle LLM inference routing/scheduling and compete at the routing layer, so you choose one.

llm-d:    Client → llm-d Router → vLLM Workers
Dynamo: Client → Dynamo Router → Prefill Workers → (NIXL) → Decode Workers
Itemllm-dDynamo
ArchitectureAggregated + DisaggregatedAggregated + Disaggregated (equal support)
KV Cache RoutingPrefix-awarePrefix-aware + Flash Indexer (radix tree)
KV Cache TransferNIXLNIXL (NVLink/RDMA)
Pod SchedulingK8s default schedulerKAI Scheduler (GPU-aware)
AutoscalingHPA/KEDA integrationPlanner (SLO-based) + KEDA/HPA
BackendvLLMvLLM, SGLang, TRT-LLM
ComplexityLow — add router to existing vLLMHigh — replace entire serving stack
Maturityv0.5+v1.0 GA
ScenarioRecommendation
Add routing to existing vLLMllm-d
Small-medium scale (8 GPUs or less)llm-d
Gateway API-based K8s nativellm-d
Large scale (16+ GPUs), maximize throughputDynamo
Long context (128K+) workloadsDynamo (3-tier KV cache)
Fast adoption, low operational complexityllm-d
Migration Path

Starting with llm-d and transitioning to Dynamo as scale grows is practical. Both share vLLM backend and NIXL KV transfer. The key differences are Dynamo's Flash Indexer, KAI Scheduler, and Planner. Dynamo 1.0 can integrate llm-d as an internal component, making it viewable as a superset rather than a complete alternative.


KAI Scheduler

KAI Scheduler is NVIDIA's GPU-aware Kubernetes Pod scheduler. Unlike the default kube-scheduler, it recognizes GPU topology (NVLink, PCIe), MIG slices, and Gang Scheduling to determine optimal Pod placement.

Core Features

FeatureDescription
GPU Topology AwarenessMinimizes communication cost by recognizing NVLink/PCIe connection structure
MIG-aware SchedulingRecognizes MIG slices as individual scheduling units
Gang SchedulingGuarantees all Pods are placed simultaneously in distributed training
Fair-share SchedulingPer-namespace/team GPU quota management
PreemptionPriority-based Pod replacement

Design Considerations

  • ClusterPolicy dependency: KAI Scheduler requires GPU Operator's ClusterPolicy to be installed
  • EKS Auto Mode: KAI Scheduler usable after installing GPU Operator with Device Plugin disabled via label
  • Relationship with kube-scheduler: KAI Scheduler does not replace kube-scheduler; it operates as a Secondary Scheduler delegated only for GPU workloads
KAI Scheduler ≠ Autoscaling

KAI Scheduler is a scheduler that decides which node to place Pods on. It is separate from autoscaling (KEDA/HPA) that increases Pod count or provisioning (Karpenter) that adds nodes.


References