NVIDIA GPU Stack
The NVIDIA GPU software stack is organized in a layered structure for operating GPUs in Kubernetes environments.
| Layer | Role | Core Component |
|---|---|---|
| Infrastructure Automation | Declaratively manage GPU drivers, runtimes, and plugins | GPU Operator (ClusterPolicy CRD) |
| Monitoring | Collect GPU state and expose Prometheus metrics | DCGM, DCGM Exporter |
| Partitioning | Share a single GPU across multiple workloads | MIG, Time-Slicing |
| Inference Optimization | Datacenter-scale LLM serving | Dynamo, KAI Scheduler |
This document covers the architecture and design decision criteria for each component. For GPU node provisioning (Karpenter), scaling (KEDA), and cost optimization, see GPU Resource Management.
GPU Operator Architecture
Concept
GPU Operator is an orchestration layer that bundles the entire GPU stack under a single ClusterPolicy CRD. Each component can be independently enabled/disabled, and GPU environments are automatically configured when nodes are added.
| Component | Version | Role |
|---|---|---|
| GPU Operator | v25.10.1 | GPU stack lifecycle management |
| NVIDIA Driver | 580.126.18 | GPU kernel driver |
| DCGM | v4.5.2 | GPU monitoring engine |
| DCGM Exporter | v4.5.2-4.8.1 | Prometheus metrics exposure |
| Device Plugin | v0.19.0 | K8s GPU resource registration |
| GFD | v0.19.0 | GPU node labeling |
| MIG Manager | v0.13.1 | MIG partition auto-management |
| Container Toolkit (CDI) | v1.17.5 | Container GPU runtime |
v25.10.1 Key New Features: Blackwell (B200/GB200) support, HPC Job Mapping, CDMM (Confidential Computing), CDI (Container Device Interface)
Component Structure
Component Roles:
- Driver DaemonSet: Installs GPU kernel driver on nodes. Set
enabled: falsefor AL2023/Bottlerocket as it's pre-installed in the AMI - Container Toolkit (CDI): Injects GPU devices into container runtime. CDI (Container Device Interface)-based for runtime independence
- Device Plugin: Registers
nvidia.com/gpuextended resource with kubelet. Enables kube-scheduler to place GPU Pods - GFD (GPU Feature Discovery): Exposes GPU model, driver version, MIG profiles as node labels. Used for nodeSelector/nodeAffinity
- NFD (Node Feature Discovery): Exposes hardware features (CPU, PCIe, NUMA, etc.) as node labels
- MIG Manager: Auto-applies MIG profiles based on ConfigMap. Reconfigures on node label changes
- DCGM Exporter: Exposes DCGM metrics in Prometheus format
GPU Operator Configuration per EKS Environment
| Environment | Driver | Toolkit | Device Plugin | MIG | Notes |
|---|---|---|---|---|---|
| EKS Auto Mode | ❌ (AWS auto) | ❌ (AWS auto) | ❌ (disabled via label) | ❌ | DCGM/NFD/GFD work normally |
| Karpenter (Self-Managed) | ❌ (AL2023 AMI) | ❌ (AL2023 AMI) | ✅ | ✅ | Full support |
| Managed Node Group | ❌ (AL2023 AMI) | ❌ (AL2023 AMI) | ✅ | ✅ | Full support |
| Hybrid Node (On-premises) | ✅ (required) | ✅ (required) | ✅ | ✅ | GPU Operator required |
- AL2023 / Bottlerocket: GPU driver pre-installed in AMI. Both
driverandtoolkitmust beenabled: false - EKS Auto Mode: AWS auto-manages drivers. Device Plugin disabled via node label
nvidia.com/gpu.deploy.device-plugin: "false"
GPU Operator on EKS Auto Mode
On Auto Mode, AWS manages GPU drivers and Device Plugin, but GPU Operator installation is still useful when:
- DCGM Exporter: GPU metrics collection (Auto Mode itself does not provide DCGM)
- GFD/NFD: Per-GPU-model node labeling for nodeSelector usage
- KAI Scheduler: Compatibility with projects depending on ClusterPolicy
# Auto Mode NodePool — Only Device Plugin disabled via label
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-auto-mode
spec:
template:
metadata:
labels:
nvidia.com/gpu.deploy.device-plugin: "false"
spec:
requirements:
- key: eks.amazonaws.com/instance-family
operator: In
values: ["p5", "p4d"]
nodeClassRef:
group: eks.amazonaws.com
kind: NodeClass
name: default
DCGM Monitoring
Overview
NVIDIA DCGM (Data Center GPU Manager) is a monitoring engine that collects GPU state and exposes metrics to Prometheus. GPU Operator automatically deploys DCGM Exporter as a DaemonSet.
Deployment Method Selection
- DaemonSet (Recommended)
- Sidecar (Special Use)
| Item | Details |
|---|---|
| Resource Efficiency | 1 instance per node — minimal overhead |
| Management | Auto-managed by GPU Operator |
| Metrics Scope | Collects all GPU metrics on the node |
| Suitable Environment | Production environments (most cases) |
| Item | Details |
|---|---|
| Resource Efficiency | 1 instance per Pod — higher overhead |
| Metrics Scope | Collects only the Pod's GPU metrics |
| Suitable Environment | Multi-tenant billing, when per-Pod isolation is needed |
K8s 1.33+ stabilized Sidecar Containers (restartPolicy: Always) can be used to operate alongside the Pod lifecycle.
Key GPU Metrics
| Metric | Description | Usage |
|---|---|---|
| DCGM_FI_DEV_GPU_UTIL | GPU core utilization (%) | HPA/KEDA trigger |
| DCGM_FI_DEV_MEM_COPY_UTIL | Memory bandwidth utilization (%) | Memory bottleneck detection |
| DCGM_FI_DEV_FB_USED / FB_FREE | Framebuffer used/free (MB) | OOM prevention, capacity planning |
| DCGM_FI_DEV_POWER_USAGE | Power usage (W) | Cost and thermal management |
| DCGM_FI_DEV_GPU_TEMP | GPU temperature (C) | Thermal throttling prevention |
| DCGM_FI_DEV_SM_CLOCK | SM clock speed (MHz) | Performance monitoring |
Prometheus Integration Concept
DCGM Exporter exposes Prometheus-format metrics at the :9400/metrics endpoint. Setting dcgmExporter.serviceMonitor.enabled=true during GPU Operator installation auto-creates the ServiceMonitor.
Collection chain:
GPU Hardware → DCGM Engine → DCGM Exporter (:9400) → Prometheus → Grafana/KEDA
Key design decisions:
- Collection interval: 15s (default). For LLM serving, 10s recommended
- Metrics filtering: Control cardinality by collecting only needed metrics via
/etc/dcgm-exporter/dcp-metrics-included.csv - Pod-GPU mapping: Setting
DCGM_EXPORTER_KUBERNETES=trueaddspod,namespace,containerlabels to metrics
GPU Partitioning Strategies
MIG (Multi-Instance GPU)
MIG partitions Ampere/Hopper/Blackwell architecture GPUs (A100, H100, H200, B200) into up to 7 hardware-isolated GPU instances. Each MIG instance has independent memory, cache, and SM (Streaming Multiprocessor), guaranteeing stable performance without inter-workload interference.
MIG Core Value:
- Hardware isolation: Memory, SM, L2 cache completely separated for QoS guarantee
- Concurrent execution: Multiple inference workloads run simultaneously without performance degradation
- GPU Operator auto-management: MIG Manager auto-applies profiles based on ConfigMap
A100 40GB MIG Profiles:
| Profile | Memory | SM Count | Use Case | Expected Throughput |
|---|---|---|---|---|
| 1g.5gb | 5GB | 14 | Small models (3B and below) | ~20 tok/s |
| 1g.10gb | 10GB | 14 | Small models (3B-7B) | ~25 tok/s |
| 2g.10gb | 10GB | 28 | Medium models (7B-13B) | ~50 tok/s |
| 3g.20gb | 20GB | 42 | Medium-large models (13B-30B) | ~100 tok/s |
| 4g.20gb | 20GB | 56 | Large models (13B-30B) | ~130 tok/s |
| 7g.40gb | 40GB | 84 | Full GPU (70B+) | ~200 tok/s |
MIG Profile Management:
GPU Operator's MIG Manager watches node labels (nvidia.com/mig.config) and auto-applies MIG profiles. Define profiles in a ConfigMap, and MIG Manager reconfigures GPUs when node labels change.
# MIG Profile ConfigMap (mig-parted format)
apiVersion: v1
kind: ConfigMap
metadata:
name: default-mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-1g.5gb: # 7 small instances
- devices: all
mig-enabled: true
mig-devices:
"1g.5gb": 7
mixed-balanced: # Mixed configuration
- devices: all
mig-enabled: true
mig-devices:
"3g.20gb": 1
"2g.10gb": 1
"1g.5gb": 2
single-7g: # Single large
- devices: all
mig-enabled: true
mig-devices:
"7g.40gb": 1
Pods request MIG devices using nvidia.com/mig-<profile> resources.
resources:
requests:
nvidia.com/mig-1g.5gb: 1
limits:
nvidia.com/mig-1g.5gb: 1
Time-Slicing
Time-Slicing shares GPU computing time across multiple Pods based on time division. Unlike MIG, it's available on all NVIDIA GPUs but lacks inter-workload memory isolation.
Configuration:
GPU Operator's ClusterPolicy references a ConfigMap to enable Time-Slicing.
# Time-Slicing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # Each GPU shared by 4 Pods
Pods request nvidia.com/gpu: 1 as usual. On Time-Slicing-enabled nodes, a GPU slice is allocated.
MIG vs Time-Slicing Comparison
| Item | MIG | Time-Slicing |
|---|---|---|
| Isolation level | Hardware isolation (memory, SM, cache) | Software time-sharing (no isolation) |
| Supported GPUs | A100, H100, H200, B200 | All NVIDIA GPUs |
| Max partitions | 7 instances | Unlimited (performance degrades proportionally) |
| Performance predictability | Guaranteed (QoS) | Varies with concurrent workload count |
| Memory safety | OOM does not affect other instances | OOM affects other workloads |
| Suitable environment | Production inference, multi-tenant | Dev/test, batch inference |
- Context switching overhead: ~1% level, negligible
- Concurrent execution degradation: 50-100% performance drop as GPU memory and compute are shared
- No memory isolation: One workload's OOM affects others
- Suitable: Batch inference, dev/test environments | Not suitable: Real-time inference (SLA), high-performance training
Dynamo: Datacenter-Scale Inference Optimization
Overview
NVIDIA Dynamo is an open-source framework that optimizes datacenter-scale LLM inference. It supports vLLM, SGLang, and TensorRT-LLM as backends, achieving up to 7x performance improvement over baselines.
- Serving modes: Aggregated + Disaggregated equally supported
- Core technologies: Flash Indexer, NIXL, KAI Scheduler, Planner, EPP
- Deployment: Kubernetes Operator + CRD (DGDR)
- License: Apache 2.0
Core Architecture
Dynamo supports both Aggregated Serving and Disaggregated Serving. In Disaggregated mode, Prefill (prompt processing) and Decode (token generation) are separated for independent scaling.
Core Components
| Component | Role | Benefit |
|---|---|---|
| Disaggregated Serving | Separate Prefill/Decode workers | Independent per-phase scaling, maximize GPU utilization |
| Flash Indexer | Radix tree-based per-worker KV cache indexing | Prefix matching optimization, maximize KV reuse |
| KVBM | GPU → CPU → SSD 3-tier cache | Maximize memory efficiency, support large-scale contexts |
| NIXL | NVIDIA Inference Transfer Library | Ultra-fast GPU-to-GPU KV Cache transfer (NVLink/RDMA). Shared by Dynamo, llm-d, production-stack, aibrix |
| Planner | SLO-based autoscaling | Profiling → SLO target-based automatic Prefill/Decode scaling |
| EPP | Endpoint Picker Protocol | Native integration with K8s Gateway API |
| AIConfigurator | Auto TP/PP recommendation | Optimal parallelization based on model size, GPU memory, network topology |
llm-d Selection Guide
llm-d and Dynamo both handle LLM inference routing/scheduling and compete at the routing layer, so you choose one.
llm-d: Client → llm-d Router → vLLM Workers
Dynamo: Client → Dynamo Router → Prefill Workers → (NIXL) → Decode Workers
| Item | llm-d | Dynamo |
|---|---|---|
| Architecture | Aggregated + Disaggregated | Aggregated + Disaggregated (equal support) |
| KV Cache Routing | Prefix-aware | Prefix-aware + Flash Indexer (radix tree) |
| KV Cache Transfer | NIXL | NIXL (NVLink/RDMA) |
| Pod Scheduling | K8s default scheduler | KAI Scheduler (GPU-aware) |
| Autoscaling | HPA/KEDA integration | Planner (SLO-based) + KEDA/HPA |
| Backend | vLLM | vLLM, SGLang, TRT-LLM |
| Complexity | Low — add router to existing vLLM | High — replace entire serving stack |
| Maturity | v0.5+ | v1.0 GA |
| Scenario | Recommendation |
|---|---|
| Add routing to existing vLLM | llm-d |
| Small-medium scale (8 GPUs or less) | llm-d |
| Gateway API-based K8s native | llm-d |
| Large scale (16+ GPUs), maximize throughput | Dynamo |
| Long context (128K+) workloads | Dynamo (3-tier KV cache) |
| Fast adoption, low operational complexity | llm-d |
Starting with llm-d and transitioning to Dynamo as scale grows is practical. Both share vLLM backend and NIXL KV transfer. The key differences are Dynamo's Flash Indexer, KAI Scheduler, and Planner. Dynamo 1.0 can integrate llm-d as an internal component, making it viewable as a superset rather than a complete alternative.
KAI Scheduler
KAI Scheduler is NVIDIA's GPU-aware Kubernetes Pod scheduler. Unlike the default kube-scheduler, it recognizes GPU topology (NVLink, PCIe), MIG slices, and Gang Scheduling to determine optimal Pod placement.
Core Features
| Feature | Description |
|---|---|
| GPU Topology Awareness | Minimizes communication cost by recognizing NVLink/PCIe connection structure |
| MIG-aware Scheduling | Recognizes MIG slices as individual scheduling units |
| Gang Scheduling | Guarantees all Pods are placed simultaneously in distributed training |
| Fair-share Scheduling | Per-namespace/team GPU quota management |
| Preemption | Priority-based Pod replacement |
Design Considerations
- ClusterPolicy dependency: KAI Scheduler requires GPU Operator's ClusterPolicy to be installed
- EKS Auto Mode: KAI Scheduler usable after installing GPU Operator with Device Plugin disabled via label
- Relationship with kube-scheduler: KAI Scheduler does not replace kube-scheduler; it operates as a Secondary Scheduler delegated only for GPU workloads
KAI Scheduler is a scheduler that decides which node to place Pods on. It is separate from autoscaling (KEDA/HPA) that increases Pod count or provisioning (Karpenter) that adds nodes.
Related Documents
- GPU Resource Management — Karpenter, KEDA, DRA, cost optimization
- EKS GPU Node Strategy — Auto Mode + Karpenter + Hybrid Node configuration
- vLLM Model Serving — vLLM-based inference engine
- llm-d EKS Auto Mode — llm-d detailed architecture