NVIDIA GPU Software Stack

📅 Created: 2026-03-20 | Updated: 2026-03-20 | ⏱️ Reading Time: ~10 minutes

Overview

The NVIDIA GPU software stack is structured in three layers for efficient GPU operations in Kubernetes environments. GPU Operator (driver and infrastructure automation) connects GPUs to Kubernetes, DCGM (Data Center GPU Manager) monitors GPU status, and Run:ai handles GPU orchestration at the top layer. This document covers the configuration and operation of each layer, along with MIG/Time-Slicing partitioning strategies and the NVIDIA Dynamo distributed inference framework.

GPU Operator Architecture

GPU Operator Latest Version (v25.10.1, as of March 2026)

Component	Version	Role
GPU Operator	v25.10.1	Full GPU stack lifecycle management
NVIDIA Driver	580.126.18	GPU kernel driver
DCGM	v4.5.2	GPU monitoring engine
DCGM Exporter	v4.5.2-4.8.1	Prometheus metrics exposure
Device Plugin	v0.19.0	K8s GPU resource registration
GFD (GPU Feature Discovery)	v0.19.0	GPU node labeling
MIG Manager	v0.13.1	MIG partition auto-management
Container Toolkit (CDI)	v1.17.5	Container GPU runtime

v25.10.1 Key Features:

Blackwell Architecture Support: Full support for B200/GB200 GPUs
HPC Job Mapping: GPU job-level metrics collection and accounting
CDMM (Confidential Data & Model Management): GPU support for Confidential Computing environments
CDI (Container Device Interface): Container runtime-independent device management

3-Layer Architecture

Role of each layer:

GPU Operator (Orchestrator): Orchestration layer that bundles the entire GPU stack via ClusterPolicy CRD. Each component (Driver, Container Toolkit, Device Plugin, DCGM Exporter, NFD, GFD, MIG Manager) can be independently enabled/disabled. Can be installed on EKS Auto Mode — only Device Plugin is disabled via node labels while other components (DCGM Exporter, NFD, GFD, etc.) operate normally.
DCGM (Sensor): Monitoring engine that reads GPU status. Collects SM Utilization, Tensor Core Activity, Memory, Power, Temperature, ECC Errors, etc.
Run:ai (Control Tower): Scheduling/management layer operating on top of GPU Operator and DCGM. Provides Fractional GPU, Dynamic MIG, Gang Scheduling, and Quota management.

Dependencies

Combination	Possible	Use Case
GPU Operator Only	Yes	Basic GPU inference, manual MIG setup, DCGM metrics
GPU Operator + Run:ai	Yes	Enterprise GPU cluster management (recommended)
DCGM Only (manual driver install)	Yes	Bare metal, single server monitoring
Run:ai Only (without GPU Operator)	No	GPU Operator ClusterPolicy is required dependency
EKS Auto Mode + Run:ai	Yes	Install GPU Operator, disable Device Plugin via label

GPU Management by EKS Environment

Node Type	GPU Driver	GPU Operator	MIG Support	Run:ai Support
Auto Mode	AWS auto-install	Installable (Device Plugin disabled via label)	Not supported	Supported (Device Plugin disabled via label)
Karpenter (Self-Managed)	GPU Operator install	Full support	Full support	Full support
Managed Node Group	GPU Operator install	Full support	Full support	Full support
Hybrid Node (on-premises)	GPU Operator required	Required	Full support	Full support

Node Strategy Detailed Guide

For detailed information on hybrid configuration of EKS Auto Mode and Karpenter, GPU Operator installation methods, and Hybrid Node GPU farm setup, refer to EKS GPU Node Strategy.

GPU Operator Component Details

GPU Operator manages the entire GPU stack declaratively through the ClusterPolicy CRD.

GPU Driver Constraints by AMI

AL2023 / Bottlerocket: GPU drivers are pre-installed in the AMI, so GPU Operator's driver component must be set to enabled: false.
AL2 (Custom AMI): GPU Operator can install drivers directly.
EKS Auto Mode: AWS manages drivers automatically, so both driver and toolkit must be set to enabled: false.

Helm Installation Example (Karpenter + Self-Managed):

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator (AL2023/Bottlerocket — disable driver/toolkit)
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace \
  --set driver.enabled=false \
  --set toolkit.enabled=false \
  --set dcgmExporter.serviceMonitor.enabled=true \
  --set migManager.enabled=true \
  --set gfd.enabled=true \
  --set nfd.enabled=true

ClusterPolicy CRD Example:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  operator:
    defaultRuntime: containerd
  driver:
    enabled: false          # AL2023/Bottlerocket: pre-installed in AMI
  toolkit:
    enabled: false          # AL2023/Bottlerocket: pre-installed in AMI
  devicePlugin:
    enabled: true
  dcgmExporter:
    enabled: true
    version: "4.5.2-4.8.1"
    serviceMonitor:
      enabled: true
  migManager:
    enabled: true
    config:
      name: default-mig-parted-config
  gfd:
    enabled: true
  nfd:
    enabled: true
  nodeStatusExporter:
    enabled: false

EKS Auto Mode NodePool Labels (Device Plugin Disabled):

On Auto Mode, install GPU Operator but disable only the Device Plugin via node labels. GPU Operator installation is required for projects like KAI Scheduler that depend on ClusterPolicy.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-auto-mode
spec:
  template:
    metadata:
      labels:
        nvidia.com/gpu.deploy.device-plugin: "false"  # Disable Device Plugin
    spec:
      requirements:
        - key: eks.amazonaws.com/instance-family
          operator: In
          values: ["p5", "p4d"]
      nodeClassRef:
        group: eks.amazonaws.com
        kind: NodeClass
        name: default

GPU Operator Configuration by EKS Environment

Environment	GPU Operator	Driver Management	MIG	Limitations
EKS Auto Mode	Installable (Device Plugin disabled)	AWS automatic (AMI pre-installed)	Not supported	Device Plugin disabled via label, DCGM/NFD/GFD operate normally
EKS + Karpenter	Helm install	Operator managed	Full support	GPU AMI required in NodePool
EKS Managed Node Group	Helm install	Operator managed	Full support	Node group-level management
EKS Hybrid Nodes	Helm install (required)	Operator required	Full support	On-premises GPU farm, network setup required

DCGM Monitoring

NVIDIA DCGM (Data Center GPU Manager) is a core component that monitors GPU status and exposes metrics to Prometheus.

Deployment Method Selection

DCGM Exporter can be deployed as DaemonSet or Sidecar. DaemonSet is recommended for most production environments.

DaemonSet (Recommended)
Sidecar (Special Purpose)

Item	Description
Resource Efficiency	1 instance per node -- minimal overhead
Management	Centralized, auto-managed by GPU Operator
Metrics Scope	Collects all GPU metrics on the node
Security	Only DaemonSet needs `SYS_ADMIN`
Suitable Environment	Production environments (most cases)

Item	Description
Resource Efficiency	1 instance per Pod -- high overhead
Management	Included in Pod spec, individual management
Metrics Scope	Collects only that Pod's GPU metrics
Security	All GPU Pods need `SYS_ADMIN`
Suitable Environment	Multi-tenant isolation, per-Pod billing tracking

Valid Sidecar Scenarios:

Multi-tenant billing: Need to precisely track GPU usage by tenant at Pod level
Cannot install DaemonSet: Environments with limited node access like EKS Auto Mode
Pod isolation: Need to independently monitor only specific Pod's GPU metrics

DaemonSet Deployment (Recommended)

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: gpu-monitoring
  labels:
    app: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:4.5.2-4.8.1-ubuntu22.04
          ports:
            - name: metrics
              containerPort: 9400
          env:
            - name: DCGM_EXPORTER_LISTEN
              value: ":9400"
            - name: DCGM_EXPORTER_KUBERNETES
              value: "true"
            - name: DCGM_EXPORTER_COLLECTORS
              value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
          volumeMounts:
            - name: pod-resources
              mountPath: /var/lib/kubelet/pod-resources
              readOnly: true
          securityContext:
            runAsNonRoot: false
            runAsUser: 0
            capabilities:
              add: ["SYS_ADMIN"]
      volumes:
        - name: pod-resources
          hostPath:
            path: /var/lib/kubelet/pod-resources

DCGM Exporter 3.3+ Features

DCGM Exporter 3.3+ provides the following enhanced features:

H100/H200 Support: Latest GPU metrics collection
Enhanced Metrics: More granular GPU status monitoring
Performance Improvements: Metrics collection with lower overhead

Sidecar Deployment (Special Purpose)

Use Kubernetes 1.33+'s stabilized Sidecar Containers to collect GPU metrics at Pod level.

apiVersion: v1
kind: Pod
metadata:
  name: vllm-with-monitoring
  namespace: ai-inference
spec:
  initContainers:
    # DCGM Exporter running as Sidecar (special purpose)
    - name: dcgm-sidecar
      image: nvcr.io/nvidia/k8s/dcgm-exporter:4.5.2-4.8.1-ubuntu22.04
      restartPolicy: Always  # K8s 1.33+ Sidecar feature
      ports:
        - name: metrics
          containerPort: 9400
      securityContext:
        capabilities:
          add: ["SYS_ADMIN"]
      resources:
        requests:
          cpu: "100m"
          memory: "128Mi"
        limits:
          cpu: "200m"
          memory: "256Mi"
  containers:
    - name: vllm
      image: vllm/vllm-openai:latest
      resources:
        requests:
          nvidia.com/gpu: 2
        limits:
          nvidia.com/gpu: 2

Key GPU Metrics

Core metrics collected by DCGM Exporter.

Metric Name	Description	Scaling Usage
DCGM_FI_DEV_GPU_UTIL	GPU core utilization (%)	HPA trigger criterion
DCGM_FI_DEV_MEM_COPY_UTIL	Memory bandwidth utilization (%)	Memory bottleneck detection
DCGM_FI_DEV_FB_USED	Framebuffer usage (MB)	OOM prevention
DCGM_FI_DEV_FB_FREE	Framebuffer available (MB)	Capacity planning
DCGM_FI_DEV_POWER_USAGE	Power usage (W)	Cost monitoring
DCGM_FI_DEV_SM_CLOCK	SM clock speed (MHz)	Performance monitoring
DCGM_FI_DEV_GPU_TEMP	GPU temperature (C)	Thermal management

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
  namespaceSelector:
    matchNames:
      - gpu-monitoring

GPU Partitioning Strategies

MIG (Multi-Instance GPU) Based Partitioning

MIG divides H100, A100, H200, and other Ampere/Hopper/Blackwell architecture GPUs into up to 7 independent GPU instances. Each MIG instance has isolated memory, cache, and SM (Streaming Multiprocessor), ensuring stable performance without workload interference.

GPU Operator's MIG Manager automatically manages MIG profiles based on ConfigMap.

# MIG Profile ConfigMap (mig-parted format)
apiVersion: v1
kind: ConfigMap
metadata:
  name: default-mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      # 7 small instances: multi-serving of small models
      all-1g.5gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.5gb": 7

      # Mixed configuration: simultaneous large + small operation
      mixed-balanced:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.20gb": 1
            "2g.10gb": 1
            "1g.5gb": 2

      # Single large instance: 70B+ models
      single-7g:
        - devices: all
          mig-enabled: true
          mig-devices:
            "7g.40gb": 1

---

# Apply MIG profile (select by node label)
# kubectl label node gpu-node-01 nvidia.com/mig.config=mixed-balanced

---

# Use MIG device in Pod
apiVersion: v1
kind: Pod
metadata:
  name: vllm-mig-inference
  namespace: ai-inference
spec:
  containers:
    - name: vllm
      image: vllm/vllm-openai:latest
      command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
      args:
        - "--model"
        - "meta-llama/Llama-2-7b-hf"
        - "--gpu-memory-utilization"
        - "0.9"
      resources:
        requests:
          memory: "4Gi"
          cpu: "4"
          nvidia.com/mig-1g.5gb: 1   # Specify MIG profile
        limits:
          nvidia.com/mig-1g.5gb: 1

A100 40GB MIG Profiles:

Profile	Memory	SM Count	Use Case	Expected Throughput
1g.5gb	5GB	14	Small models (3B or less)	~20 tok/s
1g.10gb	10GB	14	Small models (3B-7B)	~25 tok/s
2g.10gb	10GB	28	Medium models (7B-13B)	~50 tok/s
3g.20gb	20GB	42	Medium-large models (13B-30B)	~100 tok/s
4g.20gb	20GB	56	Large models (13B-30B)	~130 tok/s
7g.40gb	40GB	84	Extra-large models (70B+)	~200 tok/s

Time-Slicing Based Partitioning

Time-Slicing divides GPU computing time on a time basis, allowing multiple Pods to share the same GPU. Unlike MIG, it is available on all NVIDIA GPUs but lacks memory isolation between workloads and performance degradation occurs during concurrent execution.

Configure Time-Slicing via GPU Operator's ClusterPolicy or ConfigMap.

# Time-Slicing ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4  # Each GPU shared by 4 Pods

---

# Enable Time-Slicing in ClusterPolicy
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  devicePlugin:
    enabled: true
    config:
      name: time-slicing-config  # Reference ConfigMap
      default: any

---

# Use Time-Sliced GPU in Pod (same as regular GPU request)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-timeslice-replicas
  namespace: ai-inference
spec:
  replicas: 3  # 3 Pods share the same GPU
  selector:
    matchLabels:
      app: vllm-slice
  template:
    metadata:
      labels:
        app: vllm-slice
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          resources:
            requests:
              nvidia.com/gpu: 1  # GPU slice allocated when Time-Slicing enabled
              memory: "8Gi"
              cpu: "2"
            limits:
              nvidia.com/gpu: 1

Time-Slicing Performance Considerations:

Time-Slicing Performance Characteristics

Context Switching Overhead: Minimal at ~1% level
Concurrent Execution Performance Degradation: Shares GPU memory and compute, resulting in 50-100% performance degradation depending on concurrent workload count
No Memory Isolation: Unlike MIG, GPU memory not isolated between workloads, so one workload's OOM affects others

Use Case	Suitability	Reason
Batch inference, non-urgent tasks	Suitable	Sequential execution minimizes performance impact
Development/test environments	Suitable	GPU cost savings, performance guarantee unnecessary
Real-time inference (with SLA)	Unsuitable	Unpredictable latency during concurrent workloads
High-performance training	Unsuitable	Requires full GPU memory utilization

NVIDIA Dynamo: Datacenter-Scale Inference Optimization

Overview

NVIDIA Dynamo is an open-source framework that optimizes LLM inference at datacenter scale. It supports vLLM, SGLang, and TensorRT-LLM as backends, achieving up to 7x performance improvement over existing solutions in the SemiAnalysis InferenceX benchmark.

Dynamo v1.0 (2026.03 GA)

Supported Backends: vLLM, SGLang, TensorRT-LLM
Serving Modes: Both Aggregated + Disaggregated equally supported
Core Technologies: Flash Indexer (radix tree KV indexing), NIXL (common KV transfer), KAI Scheduler (GPU-aware Pod placement), Planner (SLO-based autoscaling), EPP (Gateway API integration)
Deployment: Kubernetes Operator + CRD based
License: Apache 2.0

Core Architecture

Dynamo equally supports both Aggregated Serving and Disaggregated Serving. In Disaggregated mode, it separates Prefill (prompt processing) and Decode (token generation) for independent scaling per stage. The latest release introduces radix tree-based Flash Indexer for indexing KV cache per worker to optimize prefix matching.

Core Components

Component	Role	Benefits
Disaggregated Serving	Separate Prefill/Decode workers (Aggregated also supported)	Independent scaling per stage, maximize GPU utilization
KV Cache Routing	Prefix-aware request routing	Improve KV Cache hit rate, reduce TTFT
Flash Indexer	Radix tree-based KV cache indexing per worker	Optimize prefix matching, maximize KV reuse rate
KVBM (KV Block Manager)	GPU → CPU → SSD 3-tier cache	Maximize memory efficiency, support large contexts
NIXL	NVIDIA Inference Transfer Library (common KV transfer engine)	Ultra-fast KV Cache transfer between GPUs (NVLink/RDMA). Used by Dynamo, llm-d, production-stack, aibrix, and most other projects
KAI Scheduler	GPU-aware K8s Pod scheduler	GPU topology, MIG slice-aware Pod placement. Depends on ClusterPolicy
Planner	SLO-based autoscaling	Run profiling → supply results to Planner → automatic scaling based on SLO targets
EPP (Endpoint Picker Protocol)	Gateway API integration	Dynamo's own EPP implementation for native K8s Gateway API integration

EKS Deployment

Dynamo is deployed to EKS using the Kubernetes Operator pattern.

Installation Steps:

# 1. Monitoring stack (Prometheus + Grafana)
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace

# 2. GPU Operator (skip if already installed)
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace

# 3. Dynamo Platform (Operator + etcd + NATS)
helm install dynamo-platform nvidia/dynamo-platform \
  --namespace dynamo-system --create-namespace

# 4. Deploy Dynamo vLLM workload
kubectl apply -f dynamo-vllm-deployment.yaml

DynamoGraphDeploymentRequest (DGDR) CRD-based deployment:

apiVersion: dynamo.nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
  name: llama-70b-disagg
  namespace: ai-inference
spec:
  graph:
    name: disaggregated-llm
    engine: vllm
    model: meta-llama/Llama-3.1-70B-Instruct
  serving:
    mode: disaggregated  # aggregated | disaggregated
    prefill:
      replicas: 2
      resources:
        nvidia.com/gpu: 4
    decode:
      replicas: 4
      resources:
        nvidia.com/gpu: 2
  routing:
    strategy: prefix-aware
    kvCacheRouting: true
  sla:
    maxTTFT: 500ms
    maxITL: 50ms
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 16
    targetUtilization: 70

AIConfigurator

Dynamo's AIConfigurator automatically recommends optimal Tensor Parallelism (TP) and Pipeline Parallelism (PP) settings based on model and hardware configuration.

Feature	Description
Automatic TP/PP Recommendation	Optimal parallelization based on model size, GPU memory, network topology
Pareto Frontier	Find optimal points on throughput-latency tradeoff
Hardware Profiling	Auto-detect GPU-to-GPU bandwidth, NVLink topology
SLA-based Optimization	Configuration recommendations based on target TTFT/ITL

llm-d vs Dynamo Selection Guide

Both llm-d and NVIDIA Dynamo handle LLM inference routing/scheduling, but they are alternatives and should be selected rather than used together.

Feature Comparison

Item	llm-d	NVIDIA Dynamo
Architecture	Aggregated + Disaggregated	Aggregated + Disaggregated (equally supported)
KV Cache Routing	Prefix-aware routing	Prefix-aware + Flash Indexer (radix tree)
KV Cache Transfer	NIXL (network also supported)	NIXL (NVLink/RDMA ultra-fast transfer)
Routing	Gateway API + Envoy EPP	Dynamo Router + own EPP (Gateway API integration)
Pod Scheduling	Default K8s scheduler (no built-in)	KAI Scheduler (GPU-aware Pod placement)
Autoscaling	HPA/KEDA integration	Planner (SLO-based: profiling → autoscale) + KEDA/HPA
vLLM Backend	Supported	Supported (also SGLang, TRT-LLM)
Kubernetes Integration	Gateway API native	Operator + CRD (DGDR) + Gateway API EPP
Complexity	Low -- add router to existing vLLM	High -- replace entire serving stack
Performance Gain	Reduce TTFT via prefix hit	Flash Indexer + Disaggregated for up to 7x throughput
Maturity	v0.5+	v1.0 GA (2026.03)

Why Difficult to Use Together

Both act as routers that decide which backend to send requests to. Since they compete at the routing layer, connecting two routers serially makes no sense.

llm-d alone:    Client → llm-d Router → vLLM Workers (Aggregated or Prefill/Decode separated)
Dynamo alone:   Client → Dynamo Router → Prefill Workers → (NIXL) → Decode Workers

Dynamo + llm-d Integration Possibility

Dynamo 1.0 can integrate llm-d as an internal component. In this case, llm-d acts not as an independent router but as Dynamo's KV Cache-aware routing layer. Rather than being complete alternatives, Dynamo can be viewed as a superset containing llm-d.

Selection Criteria

Scenario	Recommendation
Add routing only to existing vLLM deployment	llm-d
Small to medium scale (8 GPUs or fewer)	llm-d
Gateway API-based K8s native routing	llm-d
Large scale (16+ GPUs), maximize throughput	Dynamo
Need NIXL-based ultra-fast KV transfer between GPUs	Dynamo
Long context (128K+) workloads	Dynamo (NIXL + 3-tier KV cache)
Fast adoption, low operational complexity	llm-d

Migration Path

Starting with llm-d and transitioning to Dynamo as scale grows is practical. Both use vLLM as backend and leverage NIXL for KV transfer. Key differences are Dynamo's Flash Indexer (radix tree KV indexing), KAI Scheduler (GPU-aware Pod placement), and Planner (SLO-based autoscaling).

Summary

The NVIDIA GPU software stack consists of three layers: GPU Operator (infrastructure automation), DCGM (monitoring), and Run:ai (orchestration). GPUs can be efficiently partitioned through MIG and Time-Slicing, and NVIDIA Dynamo can be used to optimize LLM inference at datacenter scale.

Next Steps

EKS GPU Resource Management -- Karpenter, KEDA, DRA, cost optimization
EKS GPU Node Strategy -- Auto Mode + Karpenter + Hybrid Node configuration
vLLM Model Serving -- vLLM-based inference engine

Overview​

GPU Operator Architecture​

3-Layer Architecture​

Dependencies​

GPU Management by EKS Environment​

GPU Operator Component Details​

GPU Operator Configuration by EKS Environment​

DCGM Monitoring​

Deployment Method Selection​

DaemonSet Deployment (Recommended)​

Sidecar Deployment (Special Purpose)​

Key GPU Metrics​

Prometheus ServiceMonitor​

GPU Partitioning Strategies​

MIG (Multi-Instance GPU) Based Partitioning​

Time-Slicing Based Partitioning​

NVIDIA Dynamo: Datacenter-Scale Inference Optimization​

Overview​

Core Architecture​

Core Components​

EKS Deployment​

AIConfigurator​

llm-d vs Dynamo Selection Guide​

Feature Comparison​

Why Difficult to Use Together​

Selection Criteria​

Summary​

Next Steps​

References​

Overview

GPU Operator Architecture

3-Layer Architecture

Dependencies

GPU Management by EKS Environment

GPU Operator Component Details

GPU Operator Configuration by EKS Environment

DCGM Monitoring

Deployment Method Selection

DaemonSet Deployment (Recommended)

Sidecar Deployment (Special Purpose)

Key GPU Metrics

Prometheus ServiceMonitor

GPU Partitioning Strategies

MIG (Multi-Instance GPU) Based Partitioning

Time-Slicing Based Partitioning

NVIDIA Dynamo: Datacenter-Scale Inference Optimization

Overview

Core Architecture

Core Components

EKS Deployment

AIConfigurator

llm-d vs Dynamo Selection Guide

Feature Comparison

Why Difficult to Use Together

Selection Criteria

Summary

Next Steps

References