EKS Node Monitoring Agent
📅 Written: 2025-08-26 | Last Modified: 2026-02-13 | ⏱️ Reading Time: ~6 min
Overview
The EKS Node Monitoring Agent (NMA) is a node state monitoring tool provided by AWS. It automatically detects and reports hardware and system-level issues occurring on nodes in an EKS cluster. Released as a general availability service in 2024, it works alongside Node Auto Repair functionality to enhance cluster stability.
Problem Statement
Traditional EKS cluster operations had the following challenges:
- Lack of early detection of hardware failures
- Manual monitoring of system-level issues required
- Delayed response to node state changes
- Lack of integration between problem detection and automatic recovery
NMA was designed to address these issues.
What is EKS Node Monitoring Agent?
Key Features
- Log-based Problem Detection: Real-time analysis of system logs with pattern matching
- Automatic Event Generation: Automatically creates Kubernetes Events and Node Conditions when problems are detected
- CloudWatch Integration: Sends detected issues to CloudWatch for centralized monitoring
- EKS Add-on Support: Easy installation and management
NMA is a useful tool for automatically detecting node state issues, but cannot be a complete monitoring solution by itself. Appropriate expectation setting and use of complementary tools is necessary considering the following limitations.
✅ Recommended Usage
- Use NMA as a node state detection layer
- Supplement with Container Insights or Prometheus for metrics collection
- Use with Node Auto Repair to implement automatic recovery
- Adjust thresholds to match environment-specific characteristics
❌ Usage to Avoid
- Cannot depend on NMA alone for complete monitoring
- Cannot respond to sudden hardware failures
1. Design Goals
1.1 Comprehensive Node State Monitoring
NMA monitors various system components of EKS nodes:
- Container Runtime: Verifying Docker/containerd status
- Storage System: Monitoring disk space and I/O performance
- Networking: Validating network connectivity and configuration
- Kernel: Checking kernel modules and system state
- Accelerated Hardware: GPU (NVIDIA) and Neuron chip state (when hardware is detected)
1.2 Kubernetes Native Integration
NMA integrates tightly with Kubernetes using controller-runtime:
mgr, err := controllerruntime.NewManager(controllerruntime.GetConfigOrDie(), controllerruntime.Options{
Logger: log.FromContext(ctx),
Scheme: scheme.Scheme,
HealthProbeBindAddress: controllerHealthProbeAddress,
BaseContext: func() context.Context { return ctx },
Metrics: server.Options{BindAddress: controllerMetricsAddress},
})
1.3 Support for Diverse EKS Environments
As evident from the REST configuration logic, NMA supports various EKS environments:
- EKS Auto: Uses special user impersonation flow
- Legacy RBAC: Supports existing authorization model
- Standard: Pod-based authentication
2. Architecture and Operation Principles
2.1 Agent Startup and Initialization Flow
The following diagram shows the NMA startup process and the complete flow of the monitoring loop.
2.2 Monitor Registration and Management
NMA manages each subsystem through monitor configuration. The following shows the structure of monitor registration.
var monitorConfigs = []monitorConfig{
{
Monitor: &runtime.RuntimeMonitor{},
ConditionType: rules.ContainerRuntimeReady,
},
{
Monitor: storage.NewStorageMonitor(),
ConditionType: rules.StorageReady,
},
// ... additional monitors
}
Each monitor is connected to its corresponding Node Condition and reports state.
2.3 Node Condition-Based State Reporting
NMA leverages the Kubernetes Node Condition mechanism to report the state of each subsystem:
ContainerRuntimeReady: Container runtime stateStorageReady: Storage system stateNetworkingReady: Networking stateKernelReady: Kernel stateAcceleratedHardwareReady: GPU/Neuron hardware state (conditional)
2.4 Real-time Diagnostic Capability
On-demand diagnostic execution via NodeDiagnostic CRD:
diagnosticController := controllers.NewNodeDiagnosticController(mgr.GetClient(), hostname, runtimeContext)
This allows operators to run diagnostic commands in real-time on specific nodes.
2.5 Observability
NMA provides observability through various endpoints:
- Health Probe (
:8081): Kubernetes health checks - Metrics (
:8080): Prometheus metrics exposure - PProf (
:8082): Go profiling (optional)
2.6 Console Diagnostic Logging
When the -console-diagnostics flag is enabled, system information is periodically recorded to /dev/console:
if enableConsoleDiagnostics {
startConsoleDiagnostics(ctx)
}
This provides visibility at the instance level.
2.7 Deployment and Operations Characteristics
2.7.1 DaemonSet-Based Deployment
As seen in agent.tpl.yaml, NMA is deployed as a DaemonSet running on all worker nodes:
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: eks-node-monitoring-agent
namespace: kube-system
2.7.2 Node Selection and Constraints
Through affinity configuration in values.yaml, restrict execution to specific node types:
- Fargate nodes excluded
- EKS Auto compute types excluded
- HyperPod nodes excluded
- AMD64/ARM64 architectures only
2.7.3 Permission Management
RBAC configuration in agent.tpl.yaml applies principle of least privilege:
rules:
# monitoring permissions
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch"]
# nodediagnostic permissions
- apiGroups: ["eks.amazonaws.com"]
resources: ["nodediagnostics"]
verbs: ["get", "watch", "list"]
2.7.4 Resource Efficiency
Lightweight operations with resource limits defined in values.yaml:
resources:
requests:
cpu: 10m
memory: 30Mi
limits:
cpu: 250m
memory: 100Mi
2.8 Detectable Problem Types
2.8.1 Conditions (Auto-Recovery Targets)
DiskPressure: Insufficient disk spaceMemoryPressure: Insufficient memoryPIDPressure: Process ID exhaustionNetworkUnavailable: Network interface issuesKubeletUnhealthy: Kubelet service anomaliesContainerRuntimeUnhealthy: Docker/containerd issues
2.8.2 Events (Warning Purposes)
- Kernel soft lockup
- I/O delays
- Filesystem errors
- Network packet loss
- Hardware error symptoms (Network, Storage, GPU, CPU, Memory)
3. Differences by Deployment Method
3.1 Manual Mode (DaemonSet)
Advantages:
- Flexible version management
- ConfigMap-based configuration changes
- Custom configuration possible
Disadvantages:
- High kubelet dependency
- Delays during node bootstrap
- Affected by kubelet failures
3.2 EKS Auto Mode
Advantages:
- Embedded directly in AMI
- Independent of kubelet execution
- Higher availability
- Faster problem detection
Disadvantages:
- AMI replacement needed for updates
- Limited customization
4. Technical Limitations
4.1 Metrics Collection Limitations
- NMA is not a metrics collection tool: Cannot collect performance metrics (CPU, memory usage, etc.)
- Log parsing approach: Does not use cAdvisor; purely log analysis-based
- Prometheus endpoint: Exposes only limited health state metrics (port 8080)
4.2 Constraints When Using Alternative Backends
- No native ADOT integration
- Prometheus metrics scope very limited
- No configuration change options
- Lack of official documentation and support
4.3 Hardware Failure Detection Limitations
Can Detect:
- ✅ Gradual performance degradation
- ✅ Increased I/O errors
- ✅ Memory ECC errors
Cannot Detect:
- ❌ Sudden power loss
- ❌ Immediate hardware failure
- ❌ Complete network disconnection
5. Recommended Implementation Strategy
5.1 Multi-Layer Monitoring Architecture
Integrated Monitoring Stack:
├── L1: State Detection (NMA)
│ └── Early node problem detection
├── L2: Metrics Collection (Container Insights/Prometheus)
│ └── Detailed performance data
├── L3: Automatic Response (Node Auto Repair)