Skip to main content

14 docs tagged with "monitoring"

View all tags

EKS Node Monitoring Agent

Architecture, deployment strategies, limitations, and best practices for the AWS EKS Node Monitoring Agent that automatically detects and reports node health issues

Eval Gate · Registry · KPI

Threshold verification of trained checkpoints, kgateway-based gradual Canary deployment, MLflow Registry version management, automatic rollback on regression, cost and quality KPI dashboard configuration.

Integrations & Cost

SageMaker hybrid integration, Observability stack deployment, and coding tools cost analysis

NVIDIA GPU Stack

Architecture and EKS integration for GPU Operator, DCGM, MIG, Time-Slicing, and Dynamo

Operations & Governance

AI platform monitoring, observability, evaluation, compliance, and domain-specific operations guide

Reference Architecture

Production deployment and configuration reference architecture for the Agentic AI Platform