Skip to main content

Innovating K8s Operations with AI — AIOps Strategy Guide

📅 Written: 2026-02-12 | Updated: 2026-02-14 | ⏱️ Reading time: About 48 minutes


1. Overview

AIOps (Artificial Intelligence for IT Operations) is an operational paradigm that applies machine learning and big data analytics to IT operations, automating incident detection, diagnosis, and recovery while dramatically reducing the complexity of infrastructure management.

The Kubernetes platform provides powerful features and scalability such as declarative APIs, auto-scaling, and self-healing, but its complexity places a significant burden on operations teams. AIOps is a model that maximizes the various features and scalability of the K8s platform with AI while reducing complexity and accelerating innovation.

What This Document Covers

  • AWS open-source strategy and the evolution of EKS
  • Core AIOps architecture based on Kiro + Hosted MCP
  • Programmatic operations vs directing-based operations comparison
  • Paradigm differences between traditional monitoring and AIOps
  • AIOps core capabilities and EKS application scenarios
  • AWS AIOps service map and maturity model
  • ROI evaluation framework
Learning Path

This document is the first in the AIops & AIDLC series. Complete learning path:

  1. 1. AIOps Strategy Guide (current document) → 2. 2. Intelligent Observability Stack → 3. 3. AIDLC Framework → 4. 4. Predictive Scaling and Auto-Remediation

2. AWS Open-Source Strategy and the Evolution of EKS

AWS's container strategy has consistently evolved in the direction of transforming open-source into K8s-native managed services. The core of this strategy is to maintain the strengths of the K8s ecosystem while eliminating operational complexity.

2.1 Managed Add-ons: Eliminating Operational Complexity

EKS Managed Add-ons are extension modules where AWS directly manages core K8s cluster functionality. Currently, more than 22 Managed Add-ons are available (see AWS official list).

EKS Managed Add-ons Categories
Install with one-line aws eks create-addon · AWS manages versions · security patches
16+ Add-ons
🌐
Networking
VPC CNICoreDNSkube-proxy
Pod networking, DNS, service proxy
💾
Storage
EBS CSIEFS CSIFSx CSIMountpoint for S3Snapshot Controller
Block/file/object storage, snapshots
📊
Observability
ADOTCloudWatch AgentNode MonitoringNFM Agent
Metrics/logs/traces, Container Network Observability
🔒
Security
GuardDuty AgentPod Identity AgentPrivate CA Connector
Runtime security, IAM auth, certificates
🤖
ML
SageMaker HyperPod (Task Governance, Observability, Training, Inference)
ML training·inference workload mgmt
Key: AWS manages installation, upgrades, and security patches for Managed Add-ons.aws eks create-addon --addon-name <name>One line deploys to production.
# Managed Add-on installation example — deploy and manage with a single command
aws eks create-addon \
--cluster-name my-cluster \
--addon-name adot \
--addon-version v0.40.0-eksbuild.1

# Check installed Add-on list
aws eks list-addons --cluster-name my-cluster

2.2 Community Add-ons Catalog (2025.03)

The Community Add-ons Catalog launched in March 2025 enables one-click deployment of community tools such as metrics-server, cert-manager, and external-dns from the EKS console. Tools that previously required manual installation and management via Helm or kubectl have been incorporated into the AWS management framework.

2.3 Managed Open-Source Services — Reduce Operational Burden, Avoid Vendor Lock-in

AWS's open-source strategy has two core objectives:

  1. Eliminate operational burden: AWS handles operational tasks such as patching, scaling, HA configuration, and backups
  2. Prevent vendor lock-in: Since standard open-source APIs (PromQL, Grafana Dashboard JSON, OpenTelemetry SDK, etc.) are used as-is, you can switch to self-managed operations when needed

This strategy is not limited to observability. It provides fully managed versions of major open-source projects across the entire infrastructure spectrum, including databases, streaming, search & analytics, and ML.

AWS Managed Open Source Services
Keep open source flexibility, delegate operations to AWS
🗄️
Database
DocumentDB (MongoDB)ElastiCache (Redis/Valkey)MemoryDB (Redis)Keyspaces (Cassandra)Neptune (Graph)
📡
Streaming·Messaging
MSK (Kafka)MQ (ActiveMQ/RabbitMQ)
🔍
Search·Analytics
OpenSearch (Elasticsearch)EMR (Spark/Flink)MWAA (Airflow)
📊
Observability
AMP (Prometheus)AMG (Grafana)ADOT (OpenTelemetry)
📦
Container
EKS (Kubernetes)ECR (OCI Registry)App Mesh (Envoy)
🤖
ML·AI
SageMaker (PyTorch/TF)Bedrock (Foundation Models)
18+ managed open source services across 6 domainsNo vendor lock-in OSS + AWS managed ops

Among this broad managed open-source portfolio, the projects and services directly related to Kubernetes are organized as follows:

K8s Open Source Projects & Managed Services Map
Open source & AWS managed counterparts in Kubernetes ecosystem
Managed Add-onManaged Add-ons
K8s extensions with AWS-managed lifecycle
Open Source
AWS Managed
Role
Kubernetes VPC CNI
vpc-cni
Pod 네트워킹, Security Group for Pods, Network Policy
CoreDNS
coredns
K8s 클러스터 내부 DNS 서비스
kube-proxy
kube-proxy
K8s 서비스 네트워크 프록시
OpenTelemetry Collector
adot
메트릭 · 로그 · 트레이스 수집 (벤더 중립 백엔드 전송)
EBS CSI Driver
aws-ebs-csi-driver
EBS 블록 스토리지 프로비저닝
EFS CSI Driver
aws-efs-csi-driver
EFS 파일 스토리지 마운트
Mountpoint for S3 CSI
aws-mountpoint-s3-csi-driver
S3 객체 스토리지를 파일시스템으로
Snapshot Controller
snapshot-controller
PV 스냅샷 관리
GuardDuty Agent
aws-guardduty-agent
K8s 런타임 위협 탐지
Pod Identity Agent
eks-pod-identity-agent
Pod 수준 IAM 역할 매핑
CloudWatch Observability
amazon-cloudwatch-observability
Container Insights Enhanced · Application Signals · 1-click 온보딩
Node Monitoring Agent
eks-node-monitoring-agent
노드 하드웨어 · OS 수준 이상 탐지
Network Flow Monitor Agent
aws-network-flow-monitoring-agent
Container Network Observability 데이터 수집 · Pod 플로우 · Cross-AZ 가시성
Total: Managed Add-ons 13+ Community 5+ Capabilities 3+ Managed Services 3+ OSS Controllers 4= 28 total K8s-related open source & managed services

2.2.3 Real-World Examples of Vendor Lock-in Prevention

The core value of AWS's managed open-source strategy is reducing operational burden without vendor lock-in. Since standard open-source APIs are used as-is, you can switch to a different backend when needed.

ADOT-Based Observability Backend Switching Pattern

ADOT (AWS Distro for OpenTelemetry) is based on OpenTelemetry, allowing you to freely switch observability backends without modifying application code.

Switchable backends:

BackendTypeScope of Change When Switching
CloudWatchAWS NativeOnly change ADOT Collector exporter configuration
Datadog3rd Party SaaSOnly change ADOT Collector exporter configuration
Splunk3rd Party (SaaS/On-prem)Only change ADOT Collector exporter configuration
Grafana CloudManaged Open-SourceOnly change ADOT Collector exporter configuration
Self-hosted PrometheusSelf-ManagedOnly change ADOT Collector exporter configuration
Core Value of ADOT

When using ADOT (OpenTelemetry-based), there is no need to modify application code even when switching observability backends. This is the core value of AWS's open-source strategy. Applications generate metrics/traces/logs using the OpenTelemetry SDK, and the ADOT Collector collects and forwards them to the desired backend.

ADOT Collector Configuration Example: CloudWatch → Datadog Switch

# Using CloudWatch backend (existing)
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317

processors:
batch:

exporters:
awscloudwatch:
namespace: MyApp
region: us-east-1

service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [awscloudwatch]
# Switching to Datadog backend (only exporter changed)
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317

processors:
batch:

exporters:
datadog:
api:
site: datadoghq.com
key: ${DATADOG_API_KEY}

service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [datadog] # ← Only this part changed

Application code remains unchanged:

# Python application — no code modification needed when switching backends
from opentelemetry import metrics

meter = metrics.get_meter(__name__)
request_counter = meter.create_counter("http_requests_total")

def handle_request():
request_counter.add(1) # ← Same code regardless of backend
AMP/AMG → Self-hosted Migration Considerations

When migrating from AWS Managed Prometheus (AMP) and Grafana (AMG) to self-managed operations, the following should be considered.

AMP → Self-hosted Prometheus Migration:

ItemAMP (Managed)Self-hosted Prometheus
PromQL Compatibility100% compatible100% compatible (same queries usable)
Data MigrationRemote Write → Self-hostedNeed to build long-term storage with Thanos/Cortex
ScalingAutomatically managed by AWSNeed to build horizontal scaling with Thanos/Cortex
High AvailabilityAutomatically guaranteed by AWSMust configure clustering and replication manually
Operational BurdenNoneUpgrades, patches, monitoring, backups required
CostPay per ingestion/storage/queryInfrastructure cost + operational personnel cost

AMG → Self-hosted Grafana Migration:

ItemAMG (Managed)Self-hosted Grafana
Dashboard Compatibility100% compatible100% compatible (JSON export/import)
IAM IntegrationAWS IAM nativeMust configure SAML/OAuth manually
PluginsAWS data sources pre-installedManual installation and version management
UpgradesAutomatically performed by AWSMust plan and execute manually
High AvailabilityAutomatically guaranteed by AWSNeed to configure load balancer and session store
Comparison: AWS Managed vs Self-hosted vs 3rd Party
CriteriaAWS Managed (AMP/AMG)Self-hosted (Prometheus/Grafana)3rd Party (Datadog/Splunk)
Operational ComplexityLow (AWS manages)High (self-managed)Low (vendor manages)
Initial SetupSimple (AWS Console/CLI)Complex (cluster configuration)Simple (SaaS registration)
ScalingAutomaticManual (Thanos/Cortex needed)Automatic
Long-term StorageAMP defaults to 150 daysMust configure manually (S3 + Thanos, etc.)Per vendor policy
Cost StructureUsage-basedInfrastructure + personnelUsage or host-based
Data SovereigntyWithin AWS RegionFull controlVendor infrastructure
CustomizationLimitedFull freedomWithin vendor-provided scope
Migration EaseHigh (standard APIs)High (standard open-source)Medium (varies by vendor)
Recommendations by Migration Scenario

AWS → Self-hosted migration: Consider when data sovereignty, customization, and cost optimization (large-scale environments) are the primary reasons. However, operational capability and personnel are essential.

AWS → 3rd Party migration: Consider when integrated observability platforms (APM, logs, infrastructure monitoring integration), advanced AI/ML capabilities, or multi-cloud integration are needed.

Self-hosted → AWS migration: Useful when reducing operational burden, automating high availability, and quick startup are needed. Particularly suitable for teams lacking observability expertise.

Key Message: Even when using AWS managed services, since standard open-source APIs (PromQL, OpenTelemetry, Grafana Dashboard JSON, etc.) are used as-is, you can migrate without technical lock-in when needed. This is the key differentiating point of AWS's open-source strategy.

2.4 Key Message of the Evolution

Evolution of AWS Open Source Strategy
Remove complexity → Strengthen automation → AI ops → Autonomous ops
Stage 1Remove Operational Complexity
Build foundation with K8s-native managed services
Managed Add-ons (22+)VPC CNI, CoreDNS, ADOT, GuardDuty, EBS/EFS CSI, etc.
Managed Open SourceAMP(Prometheus), AMG(Grafana), ADOT(OpenTelemetry), MSK(Kafka), OpenSearch ...
Community CatalogOne-click deploy: metrics-server, cert-manager, external-dns, etc.
Keep OSS flexibility while delegating ops burden to AWS
Stage 2Strengthen Core Automation Components
EKS Capabilities + K8s-native automation
Managed Argo CDAWS-managed GitOps (HA · auto upgrades · IAM integration)
ACKDeclarative management of 50+ AWS services via K8s CRDs
KROResourceGroup CRD for composite resource deployment
LBC v3Gateway API GA · JWT validation · header transformation
KarpenterAuto node provisioning · instance optimization (built into EKS Auto Mode)
EKS evolves into core automation component
Stage 3Leverage AI for Efficient Operations
Programmatic automation with Kiro + Hosted MCP
KiroSpec-driven development (requirements → design → tasks → code)
Hosted MCP ServersAI direct access to EKS · Serverless · Cost · Docs
Programmatic AutomationShift from directing to code-based ops & debugging
Define spec once, execute repeatedly — cost-efficient & fast response
Stage 4Expand to Autonomous Operations with AI Agents
Q Developer(GA) + Strands(OSS) + Kagent(early) — gradual agent adoption
KagentK8s-native AI agent, MCP integration (kmcp)
Strands AgentsAWS production-validated, Agent SOPs (natural language workflows)
Amazon Q DeveloperCloudWatch Investigations, EKS troubleshooting
Multi-source insights + granular & broad control
Cumulative Evolution Model
Each stage builds on the previous — Remove complexity → Strengthen automation → AI ops → Autonomous ops
Key Insight

EKS is the core executor of AWS's open-source strategy. It eliminates operational complexity with managed services, strengthens automation components with EKS Capabilities, enables efficient AI-powered operations with Kiro+MCP, and extends to autonomous operations with AI Agents — a cumulative evolution model where each stage builds upon the previous one.


3. The Core of AIOps: AWS Automation → MCP Integration → AI Tools → Kiro Orchestration

The AWS open-source strategy (Managed Add-ons, managed services, EKS Capabilities) explored in Section 2 provides the foundation for K8s operations. AIOps is a layered architecture that integrates automation tools with MCP, connects them with AI tools, and orchestrates everything with Kiro on top of this foundation.

[Layer 1] AWS Automation Tools — Foundation
Managed Add-ons · AMP/AMG/ADOT · CloudWatch · EKS Capabilities (Argo CD, ACK, KRO)

[Layer 2] MCP Servers — Unified Interface
50+ individual MCP servers expose each AWS service as AI-accessible tools

[Layer 3] AI Tools — Infrastructure Control via MCP
Q Developer · Claude Code · GitHub Copilot etc. directly query/control AWS services via MCP

[Layer 4] Kiro — Spec-Driven Unified Orchestration
requirements → design → tasks → code generation, native MCP integration for entire workflow

[Layer 5] AI Agent — Autonomous Operations (Extension)
Kagent · Strands · Q Developer autonomously detect, decide, and execute based on events

3.1 MCP — Unified Interface for AWS Automation Tools

The Managed Add-ons, AMP/AMG, CloudWatch, and EKS Capabilities from Section 2 are each powerful automation tools, but AI needs a standardized interface to access them. MCP (Model Context Protocol) fills this role. AWS provides more than 50 MCP servers as open-source, exposing each AWS service as a tool that AI tools can invoke.

AWS MCP Servers — 50+ Service Ecosystem
AWS service map directly controlled by AI tools (Kiro, Q Developer, Claude Code)
50+ Servers
🏗️Infrastructure · IaC8
EKS MCP
Cluster status · resource mgmt
ECS MCP
Service deployment · task mgmt
IaC MCP
CloudFormation · CDK · security validation
Terraform MCP
plan/apply · security scan
Cloud Control API MCP
Direct AWS resource mgmt
Serverless MCP
Lambda/API GW/SAM
Lambda Tool MCP
Execute Lambda as AI tool
IAM MCP
Roles/policies · least privilege
📊Observability · Operations4
CloudWatch MCP
Metrics · alarms · logs · troubleshooting
Managed Prometheus MCP
PromQL query · metric lookup
CloudTrail MCP
API activity · change tracking
Support MCP
AWS Support case mgmt
🤖AI · ML5
Bedrock Knowledge Bases MCP
Enterprise RAG search
Bedrock AgentCore MCP
AgentCore platform API
SageMaker AI MCP
ML resource mgmt · development
Nova Canvas MCP
AI image generation
Q Business MCP
Enterprise AI assistant
🗄️Data · Messaging6
DynamoDB MCP
Table · CRUD · data modeling
Aurora PostgreSQL/MySQL MCP
RDS Data API DB operations
Neptune MCP
Graph DB (openCypher/Gremlin)
SNS/SQS MCP
Messaging · queue mgmt
Step Functions MCP
Workflow execution
MSK MCP
Kafka cluster mgmt
💰Cost · Dev Tools4
Cost Explorer MCP
Cost analysis · reporting
Pricing MCP
Pre-deployment cost estimation
Documentation MCP
AWS official docs search
Knowledge MCP
Code samples · content (GA, Remote)
🛡️Security · Utilities4
Git Repo Research MCP
Semantic code search · analysis
Diagram MCP
Architecture diagram generation
Frontend MCP
React · web dev guide
Finch MCP
Local container build · ECR integration
Plus 21+ additional servers (Aurora DSQL, DocumentDB, Redshift, ElastiCache, AppSync, IoT SiteWise, etc.) — See GitHub for full list
Hosting Evolution
Local
Install via npx/uvx
Run as IDE process
50+ GA
Fully Managed
AWS cloud hosted
IAM·CloudTrail integration
EKS/ECS Preview
Unified
15,000+ API single endpoint
Agent SOPs built-in
Preview
Start with Individual Local (GA) → Fully Managed for security/audit → Unified for complex ops
Full list: github.com/awslabs/mcp | Continuously updated with new servers

Detailed Comparison of 3 Hosting Methods

3 AWS MCP Server Deployment Options
Individual Local (GA) · Fully Managed (Preview) · Unified Server (Preview)
Individual Local MCP Server
50+GA
Fully Managed MCP Server
EKS, ECSPreview
AWS MCP Server (Unified)
15,000+ APIPreview
Release
2024~
2025.11
2025.11
Location
Local (npx/pip)
AWS Cloud (Remote)
AWS Cloud (Remote)
Scope
1 server per service
Cloud-hosted version per service
All AWS APIs in single server
Key Features
Service-specific deep tools (kubectl, PromQL, etc.)
IAM integration, CloudTrail auditing, auto patching, best practice KB
API execution + AWS docs + Agent SOPs (workflow guides)
Install/Connect
npx @awslabs/mcp-server-eks
Remote connection from Kiro/IDE
Remote connection from Kiro/IDE
Use Case
Direct control of individual AWS services from Kiro/IDE
Environments with enterprise security & audit requirements
Multi-service complex tasks, natural language AWS operations
Recommended Start: Start with Individual Local MCP Server (GA) to validate Kiro+MCP patterns, then migrate to Fully Managed based on enterprise security requirements. AWS MCP Server (Unified) is ideal for multi-service complex tasks.

Individual MCP vs Unified Server — Complementary, Not Replacement

The three methods are complementary, not replacement relationships. The key difference is depth vs breadth.

Individual MCP servers (EKS MCP, CloudWatch MCP, etc.) are specialized tools that understand the native concepts of their respective services. For example, EKS MCP provides Kubernetes-specific features such as kubectl execution, Pod log analysis, and K8s event-based troubleshooting. Fully Managed versions (EKS/ECS) host these same capabilities in the AWS cloud, adding enterprise requirements such as IAM authentication, CloudTrail auditing, and automatic patching.

AWS MCP Server unified is a server that generically calls 15,000+ AWS APIs. It combines AWS Knowledge MCP + AWS API MCP into one. For EKS, it can make AWS API-level calls like eks:DescribeCluster, eks:ListNodegroups, but does not provide specialized features like Pod log analysis or K8s event interpretation. Instead, its strengths are multi-service composite operations (S3 + Lambda + CloudFront combinations, etc.) and Agent SOPs (pre-built workflows).

Practical Combined Usage Pattern
EKS specialized tasks  → Individual EKS MCP (or Fully Managed)
"Analyze the cause of Pod CrashLoopBackOff"

Multi-service tasks → AWS MCP Server unified
"Deploy a static site to S3 and connect CloudFront"

Operational insights → Individual CloudWatch MCP + Cost Explorer MCP
"Analyze the cause of last week's cost spike and metric anomalies"

By connecting both individual MCP and unified servers to your IDE, AI tools automatically select the appropriate server based on task characteristics.

3.1.1 Amazon Bedrock AgentCore Integration Pattern

Amazon Bedrock AgentCore is a fully managed platform for safely deploying and managing AI Agents in production environments. By integrating with MCP servers, you can build enterprise-grade Agents that automate EKS monitoring and operational tasks.

Bedrock AgentCore Overview

Bedrock AgentCore provides the following capabilities:

CapabilityDescriptionValue in EKS Operations
Agent OrchestrationAutomatic execution of complex multi-step workflowsAutonomous execution of EKS incident response scenarios
Knowledge BasesRAG-based context retrievalLearning from past incident response history
Action GroupsExternal API/tool integrationEKS control via MCP servers
GuardrailsSafety mechanisms and filteringAutomatic blocking of dangerous operational commands
Audit LoggingCloudTrail integrated audit trailCompliance and security auditing

Bedrock Agent Architecture Pattern for EKS Monitoring/Operations

Architecture:

[CloudWatch Alarms / EventBridge Events]

[Bedrock Agent Trigger]

[Bedrock AgentCore Orchestration]
├─ Knowledge Base: Search past incident response history
├─ Action Group 1: EKS MCP Server (Pod status query, log collection)
├─ Action Group 2: CloudWatch MCP (metric analysis)
├─ Action Group 3: X-Ray MCP (trace analysis)
└─ Guardrails: Dangerous command filtering (production deletion prevention)

[Autonomous Diagnosis and Recovery Execution]

[CloudTrail Audit Log Recording]

Practical Example: Automated Pod CrashLoopBackOff Response Agent

# Bedrock Agent Definition (Terraform example)
resource "aws_bedrock_agent" "eks_incident_responder" {
agent_name = "eks-incident-responder"
foundation_model = "anthropic.claude-3-5-sonnet-20241022-v2:0"
instruction = <<EOF
You are an EKS operations expert responsible for diagnosing and resolving
Kubernetes incidents. When a Pod enters CrashLoopBackOff state:
1. Collect Pod logs and events
2. Analyze error patterns
3. Check related resources (ConfigMaps, Secrets, Services)
4. Suggest remediation or auto-fix if safe
EOF

# Action Group: EKS MCP Server Integration
action_group {
action_group_name = "eks-operations"
description = "EKS cluster operations via MCP"

api_schema {
payload = jsonencode({
openAPIVersion = "3.0.0"
info = { title = "EKS MCP Actions", version = "1.0" }
paths = {
"/getPodLogs" = {
post = {
operationId = "getPodLogs"
parameters = [
{ name = "cluster", in = "query", required = true, schema = { type = "string" } },
{ name = "namespace", in = "query", required = true, schema = { type = "string" } },
{ name = "pod", in = "query", required = true, schema = { type = "string" } }
]
}
}
"/getPodEvents" = {
post = {
operationId = "getPodEvents"
parameters = [
{ name = "cluster", in = "query", required = true },
{ name = "namespace", in = "query", required = true },
{ name = "pod", in = "query", required = true }
]
}
}
}
})
}

action_group_executor {
lambda = aws_lambda_function.eks_mcp_proxy.arn
}
}

# Guardrails: Dangerous command blocking
guardrail_configuration {
guardrail_identifier = aws_bedrock_guardrail.production_safety.id
guardrail_version = "1"
}
}

# Guardrails Definition: Production Environment Protection
resource "aws_bedrock_guardrail" "production_safety" {
name = "production-safety"

# Block production namespace deletion
content_policy_config {
filters_config {
input_strength = "HIGH"
output_strength = "HIGH"
type = "VIOLENCE" # Destructive operation filter
}
}

# Sensitive data filtering
sensitive_information_policy_config {
pii_entities_config {
action = "BLOCK"
type = "AWS_ACCESS_KEY"
}
pii_entities_config {
action = "BLOCK"
type = "AWS_SECRET_KEY"
}
}

# Only allow permitted operations
topic_policy_config {
topics_config {
name = "allowed_operations"
type = "DENY"
definition = "Pod deletion in production namespace"
}
}
}

AgentCore + MCP Server Integration Workflow

Step 1: Lambda Proxy Calls MCP Server

# Lambda Function: Bedrock Agent Action → EKS MCP Server Proxy
import json
import requests

def lambda_handler(event, context):
# Parameters passed from Bedrock Agent
action = event['actionGroup']
api_path = event['apiPath']
parameters = event['parameters']

# EKS MCP Server call (Hosted MCP endpoint)
mcp_endpoint = "https://mcp-eks.aws.example.com"

if api_path == "/getPodLogs":
response = requests.post(f"{mcp_endpoint}/tools/get-pod-logs", json={
"cluster": parameters['cluster'],
"namespace": parameters['namespace'],
"pod": parameters['pod'],
"tail": 100
})
logs = response.json()['logs']

return {
'messageVersion': '1.0',
'response': {
'actionGroup': action,
'apiPath': api_path,
'httpMethod': 'POST',
'httpStatusCode': 200,
'responseBody': {
'application/json': {
'body': json.dumps({'logs': logs})
}
}
}
}

Step 2: Automatic Agent Trigger with EventBridge Rule

{
"source": ["aws.eks"],
"detail-type": ["EKS Pod State Change"],
"detail": {
"status": ["CrashLoopBackOff"]
}
}

Bedrock Agent vs Kagent vs Strands Comparison

ItemBedrock Agent (AgentCore)KagentStrands
MaturityGA (production-ready)Early stage (alpha)Stabilizing (beta)
HostingFully managed (AWS)Self-hosted (K8s)Self-hosted or cloud
MCP IntegrationLambda Proxy requiredNative MCP clientDirect MCP tool calls
GuardrailsBuilt-in (AWS Guardrails)Custom implementation requiredPython decorator implementation
Audit TrailCloudTrail auto-integrationManual logging implementation requiredLogging plugin configuration
Knowledge BaseBedrock Knowledge Bases (RAG)External vector DB integrationLangChain RAG integration
Cost StructurePer API call billingInfrastructure cost (K8s)Infrastructure cost
Suitable ScenariosEnterprise compliance, production automationK8s native integration, experimental AI operationsGeneral-purpose Agent workflows, rapid prototyping
AdvantagesZero operational burden, enterprise-grade securityK8s CRD integration, native observabilityFlexible workflows, rich tool ecosystem
DisadvantagesLambda Proxy required, AWS dependencyEarly stage, may be unstableSelf-hosting required, operational burden

Suitable Scenarios for Each Framework

When to choose Bedrock Agent:

  • When compliance and audit trails are essential in enterprise environments
  • When you don't want to manage AI Agent infrastructure yourself
  • When safety mechanisms must be enforced with AWS Guardrails
  • When past incident history needs to be learned through RAG

When to choose Kagent:

  • When K8s native integration is the top priority (CRD, Operator patterns)
  • When you want to quickly experiment with AI operations
  • When using non-AWS cloud or on-premises K8s clusters
  • When you can tolerate instability of early-stage projects

When to choose Strands:

  • When flexible Agent workflows and tool integration are needed
  • When you want to integrate with the Python ecosystem (LangChain, CrewAI, etc.)
  • When automating various tasks beyond EKS as a general-purpose AI Agent platform
  • When prioritizing prototyping and rapid experimentation
Practical Recommended Strategy

Production environment: Start with Bedrock Agent to meet enterprise requirements (security, auditing, Guardrails), then experimentally test Kagent/Strands in development/staging environments — a hybrid strategy is recommended. Bedrock Agent provides immediate stability, while Kagent/Strands lay the foundation for future transition to K8s-native autonomous operations.

3.2 AI Tools — Infrastructure Control via MCP

Once MCP exposes AWS services as AI-accessible interfaces, various AI tools can directly query and control infrastructure through them.

AI Tools Leveraging MCP
Three AI tools directly controlling AWS infrastructure via MCP
AI Tool
MCP Usage
Strengths
Amazon Q Developer
Perform Investigations via CloudWatch MCP, EKS troubleshooting
Most mature production pattern, AWS console native integration
Claude Code
Multi-service ops analysis via concurrent MCP connections
Terminal-based, large context handling, autonomous agent loops
GitHub Copilot
Reference infra state during coding via MCP extension
IDE native integration, code autocomplete + MCP combination
Key: Each AI tool accesses AWS services via MCP, but with different usage patterns and integration levels. Q Developer: AWS console integration, Claude Code: autonomous agents, Copilot: IDE integration.

At this stage, AI tools perform individual tasks according to human instructions. They respond based on real-time data via MCP to prompts like "Check Pod status" or "Analyze costs." Useful, but limited in that each task is independent and requires human instruction each time.

3.3 Kiro — Spec-Driven Unified Orchestration

Kiro is an orchestration layer that goes beyond the limitations of individual AI tools, defining entire workflows as Specs and executing them consistently through MCP. Designed as MCP-native, it integrates directly with AWS MCP servers.

Kiro's Spec-driven workflow:

  1. requirements.md → Define requirements as structured Specs
  2. design.md → Document architectural decisions
  3. tasks.md → Automatically decompose implementation tasks
  4. Code generation → Generate code, IaC, and configuration files reflecting actual infrastructure data collected via MCP

If individual AI tools work in a "ask and answer" fashion, Kiro chains multiple MCP server calls from a single Spec definition to reach the final deliverable.

[1] Spec Definition (requirements.md)
"Optimize EKS cluster Pod auto-scaling based on traffic patterns"

[2] Collect Current State via MCP
├─ EKS MCP → Cluster configuration, HPA settings, node status
├─ CloudWatch MCP → Traffic patterns over the past 2 weeks, CPU/memory trends
└─ Cost Explorer MCP → Current cost structure, spending by instance type

[3] Context-Based Code Generation
Kiro generates based on collected data:
├─ Karpenter NodePool YAML (instance types matching actual traffic)
├─ HPA configuration (target values based on measured metrics)
└─ CloudWatch alarms (thresholds based on actual baselines)

[4] Deployment and Verification
Deploy via Managed Argo CD with GitOps → Real-time deployment result verification via MCP

The key to this workflow is that AI generates code based on actual infrastructure data, not abstract guesses. Without MCP, AI can only suggest general Best Practices, but with MCP, it creates customized deliverables reflecting the actual state of the current cluster.

Kiro + MCP Architecture (Agent Extensible)
Observability backends (AWS · OSS · 3rd Party) → MCP abstraction → AI tools → Automation actions (→ Agent extension)
Observability Data Sources
AWS native · OSS · 3rd party all supported
📈
Metrics
AMP · CloudWatch · Datadog, etc.
🔗
Traces
X-Ray · Jaeger · Datadog APM, etc.
📋
Logs
OpenSearch · CloudWatch · Sumo Logic, etc.
☸️
K8s API
Events · status · resources
MCP Integration Layer (50+ servers)
Single interface regardless of observability backend
☸️
EKS MCP
Cluster control
📈
CloudWatch MCP
Metrics · alarms · logs
💰
Cost Explorer MCP
Cost analysis
🔒
IAM MCP
Security mgmt
📖
Core MCP
50+ server orchestration
AI Tools (Production Ready)
🤖
Q Developer
CloudWatch Investigations · troubleshooting (GA)
🔧
Kiro
Spec-driven dev · MCP native
💻
AI IDE
Claude Code · GitHub Copilot, etc.
Agent Extension (Gradual Adoption)
📋
Strands SDK
Agent SOPs — natural language workflows (OSS)
⚙️
Kagent
K8s-native Agent — kmcp (early stage)
Automation Actions
Auto Incident ResponseDeployment ValidationResource OptimizationCost ReductionRoot Cause Analysis

3.4 Extension to AI Agents — Autonomous Operations

If Kiro + MCP is orchestration where "humans define Specs and AI executes," AI Agent frameworks are the next stage where AI autonomously detects, decides, and executes based on events. On the same infrastructure interface provided by MCP, Agents run their own loops without human intervention.

AI Agent Framework Comparison
Three frameworks for autonomous operations
Tool
Nature
Maturity
Amazon Q Developer
AI Assistant — CloudWatch Investigations, code review, security scan
GAProduction Ready
Strands Agents SDK
AWS OSS Agent framework — Define natural language workflows via Agent SOPs
Open SourceAWS Internal Use
Kagent
CNCF community K8s-native AI Agent — CRD-based, MCP integration via kmcp
Early StageExperimental
Recommended Approach: Start with Q Developer (GA) → Automate workflows with Strands (OSS) → Explore K8s-native autonomous ops with Kagent (early)

3.5 Amazon Q Developer & Q Business Latest Features

Amazon Q Developer and Q Business are AWS's representative AI-based operational tools. The two products are designed for different purposes but are used complementarily in the AIOps context.

Amazon Q Developer vs Q Business

Amazon Q Developer is a developer productivity tool specializing in code writing, infrastructure automation, and troubleshooting. Amazon Q Business is a business data analysis tool used for operational log and metric analysis and business insight generation. In AIOps, Q Developer is used for code/infrastructure automation, and Q Business for generating insights based on operational logs/metrics.

Amazon Q Developer Latest Features (2025-2026)

1. Real-time Code Build and Test (February 2025)

Q Developer now automatically builds and tests code changes before the developer reviews them.

Features:

  • Immediate build execution after code generation
  • Automatic unit test execution and result reporting
  • Automatic fix suggestions on build failures
  • Quality verification completed before developer review

Usage in EKS Environments:

Developer: "Add resource limits to the Deployment YAML and set up HPA"

Q Developer:
1. Modify Deployment YAML (add requests/limits)
2. Generate HPA YAML
3. Validate syntax with kubectl apply --dry-run=client
4. Present changes to developer (already verified)

Reference Materials:

2. CloudWatch Investigations Integration — AI-Based Root Cause Analysis

Q Developer integrates with CloudWatch Investigations to explain the root cause of operational incidents in natural language.

Workflow:

1. CloudWatch alarm triggered (e.g., memory usage spike in EKS Pod)
2. Ask Q Developer: "Why did Pod memory spike in the production namespace?"
3. Q Developer auto-analyzes:
├─ CloudWatch metrics: Memory usage patterns
├─ X-Ray traces: Memory leak suspected in specific API call
├─ EKS logs: OutOfMemory error logs
└─ Recent deployment history: New version deployed 2 hours ago
4. Q Developer response:
"A memory leak occurred due to a cache invalidation logic bug in v2.3.1
deployed 2 hours ago. Cache accumulates on /api/users endpoint calls.
Recommended action: Roll back to v2.3.0 or set a cache TTL."

3. Cost Explorer Integration — Automatic Cost Optimization Suggestions

Q Developer integrates with AWS Cost Explorer to automatically analyze cost spike causes and suggest optimization measures.

EKS Cost Optimization Scenario:

Developer: "Tell me why EKS costs spiked last week"

Q Developer Analysis:
├─ Cost Explorer: EC2 instance costs increased 40%
├─ CloudWatch metrics: Average CPU utilization 25% (over-provisioned)
├─ Karpenter logs: Mostly using c5.4xlarge instances
└─ Workload pattern: Memory-intensive, not CPU-intensive

Q Developer Recommendations:
1. Change c5.4xlarge → r5.2xlarge (memory-optimized instances)
2. Add Spot instance priority to Karpenter NodePool
3. Adjust HPA settings from CPU-based to memory-based
Estimated savings: $1,200/month (approximately 30%)

4. Direct Console Troubleshooting — Natural Language Queries for EKS Cluster Issues

You can invoke Q Developer from the AWS console to immediately query the current state of EKS clusters.

Examples:

Invoking Q Developer from the console:

Question: "Are there any Pods in CrashLoopBackOff state in this cluster?"
Answer: "The api-server Pod in the production namespace is in CrashLoopBackOff state.
Cause: ConfigMap 'api-config' does not exist."

Question: "What alarms are currently active?"
Answer: "Currently 3 CloudWatch alarms are in ALARM state:
1. EKS-HighMemoryUsage (exceeded 80% threshold)
2. EKS-FailedPods (more than 5 failures)
3. EKS-DiskPressure (node disk 90% used)"

5. Security Scan Auto-Fix Suggestions

Q Developer automatically scans code and IaC (Infrastructure as Code) for security vulnerabilities and suggests fixes.

Kubernetes YAML Security Scan Example:

# Deployment written by developer
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
image: myapp:latest
securityContext:
runAsUser: 0 # ⚠️ Security issue: running as root

# Q Developer suggestion:
# "Running a container as root user is a security risk.
# Modify as follows:"

securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL

Amazon Q Business — Actionable Insights from Logs

Amazon Q Business specializes in analyzing business data (logs, metrics, documents) to generate action items.

CloudWatch Logs → Q Business Workflow:

1. Store EKS application logs in CloudWatch Logs
2. Connect CloudWatch Logs as a data source to Q Business
3. Natural language queries:
"What was the most frequent error in the last 24 hours?"
"What time period had the highest error rate and why?"
"What was the failure with the greatest customer impact?"

4. Q Business response:
- Error frequency chart by type
- Estimated number of affected users
- Root cause analysis (e.g., DB timeout on a specific API endpoint)
- Action item generation (e.g., "Need to increase DB connection pool size")

Operational Insight Auto-Generation Examples:

QueryQ Business Response
"How did error rates change after this week's deployment?""Error rate increased from 15% to 22% after Monday's deployment. Main cause: /api/checkout endpoint timeout. Recommendation: Increase timeout from 5 seconds to 10 seconds"
"Which service costs the most?""The api-gateway service accounts for 40% of total costs. Main cause: Unnecessary log storage (Debug level). Recommendation: Changing log level to Info can save $800/month"
"What feature has the most customer complaints?""3 timeout incidents occurred in the payment feature last week. Impact: Approximately 200 customers experienced payment failures. Recommendation: Adjust HPA settings for the payment service and optimize DB queries"

Q Developer vs Q Business Usage Comparison:

ScenarioQ DeveloperQ Business
Code debugging✅ Recommended-
IaC creation/modification✅ Recommended-
Infrastructure troubleshooting✅ Recommended-
Log pattern analysisPossible✅ Recommended
Business insights-✅ Recommended
Executive report generation-✅ Recommended
Practical Usage Patterns

Development teams use Q Developer for code writing, IaC management, and immediate troubleshooting. Operations teams use Q Developer for infrastructure issue resolution and Q Business for long-term trend analysis and cost optimization insights. Executives use Q Business to generate operational status reports in natural language.

Reference Materials:


Practical Application Guide
  • Start now: Introduce AI-based troubleshooting with Q Developer + CloudWatch MCP combination
  • Developer productivity: Build Spec-driven development workflows with Kiro + EKS/IaC/Terraform MCP
  • Gradual expansion: Codify repetitive operational scenarios as Strands Agent SOPs
  • Future exploration: Transition to autonomous operations when K8s-native Agent frameworks like Kagent mature
Core Value

The core value of this layered architecture is that each layer is independently valuable, while the level of automation increases as layers are stacked. Just connecting MCP allows direct infrastructure queries from AI tools; adding Kiro enables Spec-driven workflows; and adding Agents extends to autonomous operations. Regardless of which observability stack you use — AMP/CloudWatch/Datadog — MCP abstracts it as a single interface, so AI tools and Agents operate identically regardless of the backend.


4. Operations Automation Patterns: Human-Directed, Programmatically-Executed

The core of AIOps is the "Human-Directed, Programmatically-Executed" model where humans define intent and guardrails, and systems execute programmatically. This model is implemented as a spectrum of three patterns in the industry.

4.1 Prompt-Driven (Interactive) Operations

A pattern where humans give natural language prompt instructions at each step, and AI performs a single task. ChatOps and AI assistant-based operations fall into this category.

Operator: "Check the Pod status of the current production namespace"
AI: (Executes kubectl get pods -n production and returns results)
Operator: "Show me the logs of the Pod in CrashLoopBackOff state"
AI: (Executes kubectl logs and returns results)
Operator: "It seems to be out of memory, increase the limits"
AI: (Executes kubectl edit)

Suitable situations: Exploratory debugging, analysis of new types of failures, one-time actions Limitations: Human is involved in every step of the loop (Human-in-the-Loop), inefficient for repetitive scenarios

4.2 Spec-Driven (Codified) Operations

A pattern where operational scenarios are declaratively defined as specifications (Specs) or code, and systems execute them programmatically. IaC (Infrastructure as Code), GitOps, and Runbook-as-Code fall into this category.

[Intent Definition]  Declare operational scenarios via requirements.md / SOP documents

[Code Generation] Generate automation code with Kiro + MCP (IaC, runbooks, tests)

[Verification] Automated tests + Policy-as-Code verification

[Deployment] Declarative deployment via GitOps (Managed Argo CD)

[Monitoring] Observability stack continuously tracks execution results

Suitable situations: Repetitive deployments, infrastructure provisioning, standardized operational procedures Core value: Define Spec once → No additional cost for repeated execution, consistency guaranteed, Git-based audit trail

4.3 Agent-Driven (Autonomous) Operations

A pattern where AI Agents detect events, collect and analyze context, and autonomously respond within predefined guardrails. Human-on-the-Loop — humans set guardrails and policies, and Agents execute.

[Event Detection]    Observability stack → Alert trigger

[Context Collection] Unified query of metrics + traces + logs + K8s state via MCP

[Analysis/Decision] AI performs root cause analysis + determines response plan

[Autonomous Execution] Auto-recovery within guardrail scope (Kagent/Strands SOPs)

[Feedback Learning] Record results and continuously improve response patterns

Suitable situations: Automated incident response, cost optimization, 4. Predictive Scaling and Auto-Remediation Core value: Second-level response, 24/7 unmanned operations, context-based intelligent decision-making

4.4 Pattern Comparison: EKS Cluster Issue Response Scenario

Operation Pattern Comparison: EKS Cluster Issue Response
Prompt-Driven · Spec-Driven · Agent-Driven
항목
Prompt-Driven
Spec-Driven
Agent-Driven
Human Role
Direct each step (Human-in-the-Loop)
Define intent + Review results
Set guardrails + Handle exceptions (Human-on-the-Loop)
Response Start
Operator checks alert, then directs AI
Pre-defined pipeline trigger
Agent auto-starts after receiving alert
Data Collection
Request one by one via prompts
Auto-collect data defined in spec
Concurrent multi-source collection via MCP
Analysis
Operator reviews results, directs next step
Execute pre-defined validation logic
AI auto-analyzes to root cause
Recovery
AI executes after operator approval
Declarative rollback/change via GitOps
Autonomous recovery within guardrails
Learning
Relies on operator personal experience
Org knowledge via spec version history
Auto-learn from result feedback
Response Time
Minutes~Hours
Minutes
Seconds~Minutes
Representative Tools
Q Developer, ChatOps
Kiro + GitOps + Argo CD
Kagent, Strands SOPs
Real-world Combination: The three patterns are complementary. Explore new failures with Prompt-Driven, codify recurring patterns with Spec-Driven, and finally automate with Agent-Driven in a gradual maturity process.
Combining Patterns in Practice

The three patterns are not mutually exclusive but complementary. In actual operations, you go through a gradual maturation process of exploring and analyzing new failure types with Prompt-Driven, codifying repeatable patterns with Spec-Driven, and ultimately automating with Agent-Driven. The key is to automate repetitive operational scenarios so that operations teams can focus on strategic work.


5. Traditional Monitoring vs AIOps

⚖️ Traditional Monitoring vs AIOps
Paradigm shift comparison
Data Analysis
Rule-based thresholds
Data Analysis
ML-based pattern recognition
Anomaly Detection
Static threshold alerts
Anomaly Detection
Dynamic baseline anomaly detection
Root Cause Analysis
Manual log analysis
Root Cause Analysis
AI auto correlation analysis
Alerting
Alert storm (Alert Fatigue)
Alerting
Intelligent alert grouping/suppression
Automation
Limited script-based
Automation
AI Agent autonomous response
Scalability
Manual configuration mgmt
Scalability
Auto adaptive scaling
Cost Efficiency
Over-provisioning
Cost Efficiency
AI-powered right-sizing

The Core of the Paradigm Shift

Traditional monitoring is a model where humans define rules and systems execute rules. AIOps is a transition to a model where systems learn patterns from data and humans make strategic decisions.

Why this transition is particularly important in EKS environments:

  1. Microservice complexity: Dozens to hundreds of services interact, making it difficult to manually identify all dependencies
  2. Dynamic infrastructure: Infrastructure continuously changes with Karpenter-based automatic node provisioning
  3. Multi-dimensional data: Metrics, logs, traces, K8s events, and AWS service events occur simultaneously
  4. Speed requirements: Frequent deployments based on GitOps diversify failure causes

6. AIOps Core Capabilities

Let's examine the four core capabilities of AIOps along with EKS environment scenarios.

6.1 Anomaly Detection

Detects anomalies using ML-based dynamic baselines rather than static thresholds.

EKS Scenario: Gradual Memory Leak

Traditional approach:
Memory usage > 80% → Alert → Operator checks → OOMKilled already occurred

AIOps approach:
ML model detects slope change in memory usage pattern
→ "Memory usage is showing an abnormal increasing trend compared to normal"
→ Preemptive alert before OOMKilled occurs
→ Agent automatically collects memory profiling data

Applied services: DevOps Guru (ML anomaly detection), CloudWatch Anomaly Detection (metric bands)

6.2 Root Cause Analysis

Automatically identifies root causes through correlation analysis of multiple data sources.

EKS Scenario: Intermittent Timeouts

Symptom: Intermittent 504 timeouts in API service

Traditional approach:
Check API Pod logs → Normal → Check DB connections → Normal
→ Check network → Check CoreDNS → Cause unknown → Hours spent

AIOps approach:
CloudWatch Investigations auto-analyzes:
├─ X-Ray traces: Latency in DB connections for a specific AZ
├─ Network Flow Monitor: Increased packet drops in that AZ's subnet
└─ K8s events: ENI allocation failures on nodes in that AZ
→ Root cause: Subnet IP exhaustion
→ Recommended action: Expand subnet CIDR or enable Prefix Delegation

Applied services: CloudWatch Investigations, Q Developer, Kiro + EKS MCP

6.3 Predictive Analytics

Learns from past patterns to predict future states and take preemptive action.

EKS Scenario: Traffic Spike Prediction

Data: Request volume patterns by time of day over the past 4 weeks

ML prediction:
2.5x traffic spike expected Monday at 09:00 (weekly pattern)
→ Preemptive node provisioning in Karpenter NodePool
→ Pre-adjust HPA minReplicas
→ Accommodate traffic without Cold Start

Applied services: CloudWatch metrics + Prophet/ARIMA models + Karpenter

For detailed implementation methods, see 4. Predictive Scaling and Auto-Remediation.

6.4 Auto-Remediation

Autonomously recovers within predefined safety boundaries for detected anomalies.

EKS Scenario: Pod Eviction Due to Disk Pressure

Detection: DiskPressure condition activated on Node

AI Agent Response:
1. Clean container image cache on the node (crictl rmi --prune)
2. Clean temporary files
3. Verify DiskPressure condition resolved
4. If not resolved:
├─ Cordon the node (block new Pod scheduling)
├─ Drain existing Pods to other nodes
└─ Karpenter auto-provisions new nodes
5. Escalation: Alert operations team if recurring + recommend root volume size increase

Applied services: Kagent + Strands SOPs, EventBridge + Lambda

Safety Mechanism Design

When implementing auto-remediation, always set up guardrails:

  • Phased execution in production environments (canary → progressive)
  • Save current state snapshot before recovery execution
  • Automatic rollback on recovery failure
  • Limit number of identical recoveries within a time window (infinite loop prevention)

6.5 Node Readiness Controller and Declarative Node Management

Node Readiness Controller (NRC) is a feature introduced as alpha in Kubernetes 1.32 that declaratively manages node Readiness state using CRDs (Custom Resource Definitions). This is an important example showing that the K8s ecosystem is evolving from imperative node management to declarative node management.

Node Readiness Controller from an AIOps Perspective

Limitations of the traditional approach:

Node anomaly detected → Manually run kubectl cordon/drain
Problems:
- Manual intervention required (response delay)
- Inconsistent responses (different procedures per operator)
- Difficult to track node state changes (no audit trail)

NRC-based declarative management:

apiVersion: node.k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
name: disk-pressure-auto-taint
spec:
selector:
matchExpressions:
- key: node.kubernetes.io/disk-pressure
operator: Exists
taints:
- key: node.kubernetes.io/disk-pressure
effect: NoSchedule
- key: node.kubernetes.io/disk-pressure
effect: NoExecute
tolerationSeconds: 300 # Pod eviction after 5-minute grace period

Now when a DiskPressure condition occurs, NRC automatically adds taints to block new Pod scheduling, and existing Pods are evicted after 5 minutes. Node isolation is possible through declarative policies alone without manual operator intervention.

AIOps Integration Scenario: AI-Based Predictive Node Management

NRC enables proactive node management when combined with AI-based predictive analytics.

Scenario: Preemptive Node Isolation Based on Hardware Failure Prediction

[Phase 1] Anomaly Detection
CloudWatch Agent → Collect node hardware metrics
├─ Gradual decrease in disk IOPS (30% degradation vs normal)
├─ Increase in memory ECC errors (5 occurrences in last hour)
└─ Rising CPU temperature trend (45°C → 62°C)

ML model analysis: "85% probability of hardware failure within 72 hours"

[Phase 2] AI Agent Updates Node Condition
Kagent/Strands Agent sets custom Node Condition:
kubectl annotate node ip-10-0-1-42 predicted-failure=high-risk

[Phase 3] NRC Automatically Manages Taints
NodeReadinessRule detects the Condition → Automatically adds taints
├─ Block new Pod scheduling (NoSchedule)
├─ Existing workloads continue normal operation (grace period)
└─ Karpenter provisions replacement nodes

[Phase 4] Gradual Workload Migration
AI Agent determines priority by workload characteristics:
1. Migrate stateless applications first (no downtime)
2. Stateful workloads wait for maintenance window
3. Remove node after all workloads are migrated

Core Value:

Traditional ApproachNRC + AIOps Approach
Response after failure occursPreemptive action before failure occurs
Manual cordon/drainAutomatic processing based on declarative policies
Inconsistent responsesStandardized responses via CRD
Difficult audit trailPolicy version control via Git
Potential downtimeZero-downtime through gradual workload migration

DevOps Agent Integration Patterns

Pattern 1: Node Problem Detector + NRC

Node Problem Detector detects hardware anomaly
→ Node Condition update (DiskPressure, MemoryPressure, etc.)
→ NRC automatically adds taints
→ Karpenter provisions replacement nodes

Pattern 2: AI Prediction + NRC (Proactive)

CloudWatch Agent collects metrics
→ AI model predicts failure
→ DevOps Agent sets custom Node Condition
→ NRC applies declarative policies
→ Zero-downtime workload migration

Pattern 3: Security Event-Based Automatic Isolation

GuardDuty detects abnormal process on node
→ EventBridge → Lambda → Adds security-risk Condition to Node
→ NRC immediately applies NoExecute taint
→ All Pods evicted (preventing security incident spread)
→ Node maintained in isolated state for forensic analysis

Position in the AIOps Maturity Model

Maturity LevelNode Management ApproachNRC Utilization
Level 0 (Manual)Manual cordon/drainNot applied
Level 1 (Reactive)Node Problem Detector + manual responseNot applied
Level 2 (Declarative)Condition-based automatic taint management with NRCNRC adoption
Level 3 (Predictive)AI predicts node failure + preemptive isolation with NRC✅ AI + NRC integration
Level 4 (Autonomous)DevOps Agent + NRC for fully autonomous node lifecycle management✅ Agent + NRC automation
Evolution of the K8s Ecosystem

Node Readiness Controller demonstrates that the Kubernetes ecosystem is evolving from imperative to declarative, and from reactive to predictive. When NRC is combined with AI-based predictive analytics, workloads can be preemptively migrated before node failures occur, enabling zero-downtime operations. This is an implementation of AIOps' core value — "AI solves problems before humans need to intervene" — in the node management domain.

References:

6.6 Multi-Cluster AIOps Management

Large organizations operate multiple EKS clusters for development, staging, production, and more. To effectively implement AIOps in multi-cluster environments, unified observability, centralized AI insights, and organization-wide governance are required.

Multi-Cluster AIOps Strategy

Key Challenges:

ChallengeDescriptionSolution
Distributed observabilityIndependent monitoring stacks per clusterCentralize with CloudWatch Cross-Account Observability
Duplicate alertsSame issue generates individual alerts across multiple clustersCorrelation analysis and unified insights with Amazon Q Developer
Inconsistent responsesDifferent incident response procedures per clusterStandardized workflows with Bedrock Agent + Strands SOPs
Lack of governancePolicy inconsistency across clustersUnified policies with AWS Organizations + OPA/Kyverno
Insufficient cost visibilityDifficult to compare costs across clustersIntegrated dashboard with CloudWatch + Cost Explorer

1. Centralized Monitoring with CloudWatch Cross-Account Observability

CloudWatch Cross-Account Observability consolidates metrics, logs, and traces from multiple AWS accounts into a single observability account.

Architecture:

[Development Account]        [Staging Account]        [Production Account]
EKS Cluster A EKS Cluster B EKS Cluster C
└─ CloudWatch Agent └─ CloudWatch Agent └─ CloudWatch Agent
└─ ADOT Collector └─ ADOT Collector └─ ADOT Collector
↓ ↓ ↓
└────────────────────────────┴─────────────────────────┘

[Observability Account (Central)]
├─ Amazon Managed Prometheus (AMP)
├─ Amazon Managed Grafana (AMG)
├─ CloudWatch Logs Insights (unified logs)
├─ X-Ray (unified traces)
└─ Amazon Q Developer (unified insights)

Setup Method:

# Step 1: Configure Monitoring Account in the Observability account
aws oam create-sink \
--name multi-cluster-observability \
--tags Key=Environment,Value=Production

# Step 2: Create Link from each source account (dev/staging/prod)
aws oam create-link \
--resource-types "AWS::CloudWatch::Metric" \
"AWS::Logs::LogGroup" \
"AWS::XRay::Trace" \
--sink-identifier "arn:aws:oam:us-east-1:123456789012:sink/sink-id" \
--label-template '$AccountName-$Region'

# Step 3: Create unified dashboard in AMG (consolidate all cluster metrics)

Unified Dashboard Example (AMG):

# Grafana Dashboard JSON — Multi-cluster Pod status overview
{
"title": "Multi-Cluster EKS Overview",
"panels": [
{
"title": "Pod Status Across All Clusters",
"targets": [
{
"expr": "sum by (cluster, namespace, phase) (kube_pod_status_phase{cluster=~\".*\"})",
"datasource": "AMP-Cross-Account"
}
]
},
{
"title": "Node Health by Cluster",
"targets": [
{
"expr": "sum by (cluster, condition) (kube_node_status_condition{condition=\"Ready\",cluster=~\".*\"})",
"datasource": "AMP-Cross-Account"
}
]
}
]
}

2. Multi-Cluster Insights with Amazon Q Developer

Amazon Q Developer performs cross-cluster correlation analysis based on unified observability data.

Use Cases:

QuestionQ Developer AnalysisValue
"Why did latency increase simultaneously across multiple clusters yesterday at 3 PM?"Analyzes X-Ray traces to identify CPU spike on a shared RDS instanceNo need for per-cluster investigation, immediate root cause identification
"Why is the cost difference between production and staging clusters so large?"Analyzes Cost Explorer data to discover excessive NAT Gateway costs in productionCost optimization opportunity discovery
"Are we applying the same security policies across all clusters?"Compares GuardDuty Findings to detect weak RBAC settings in the development clusterSecurity governance strengthening

Practical Example: Multi-Cluster Failure Correlation Analysis

Developer: "All production clusters simultaneously had Pods go into CrashLoopBackOff state at 10 AM today. Why?"

Q Developer analysis:
1. Unified log analysis across all clusters with CloudWatch Logs Insights
→ Common pattern: "Failed to pull image: registry.example.com/app:v2.1"

2. Image registry access analysis with X-Ray traces
→ registry.example.com DNS lookup failure (Route 53)

3. Route 53 health check verification with CloudWatch metrics
→ registry.example.com health check changed to UNHEALTHY at 9:58 AM

4. Root cause identification
→ Image registry server TLS certificate expiration

5. Recommended action
→ Renew certificate then restart Pods across all clusters

3. Organization-Wide AIOps Governance Framework

In multi-cluster environments, consistent policy enforcement and standardized response procedures are essential.

Governance Layers
[Layer 1] AWS Organizations — Define account and cluster hierarchy

[Layer 2] Service Control Policies (SCPs) — Organization-wide security policies

[Layer 3] OPA/Kyverno — Per-cluster K8s policies (Pod Security, Network Policy)

[Layer 4] Bedrock Agent Guardrails — AI auto-response safety mechanisms

[Layer 5] CloudTrail + CloudWatch Logs — Audit trail and compliance verification
Standardized Incident Response Workflow

Multi-cluster response automation with Bedrock Agent + Strands SOPs:

# Strands SOP: Multi-cluster Pod CrashLoopBackOff response
from strands import Agent, sop

@sop(name="multi_cluster_crash_response")
def handle_multi_cluster_crash(event):
"""
Unified response when the same issue occurs across multiple clusters
"""
affected_clusters = event['clusters'] # ['dev', 'staging', 'prod']

# Step 1: Verify same pattern across all clusters
common_error = analyze_common_pattern(affected_clusters)

if common_error:
# Step 2: Identify common root cause (e.g., external dependency failure)
root_cause = identify_shared_dependency(common_error)

# Step 3: Resolve root cause centrally
fix_shared_dependency(root_cause)

# Step 4: Propagate automatic recovery to all clusters
for cluster in affected_clusters:
restart_affected_pods(cluster)
verify_recovery(cluster)

return {
'status': 'resolved',
'root_cause': root_cause,
'affected_clusters': affected_clusters
}
else:
# Step 5: Individual per-cluster response needed
return {
'status': 'escalated',
'message': 'No common pattern found, escalating to ops team'
}
Multi-Cluster Policy Standardization (OPA)
# OPA Policy: Apply identical Pod Security Standards across all clusters
package kubernetes.admission

deny[msg] {
input.request.kind.kind == "Pod"
not input.request.object.spec.securityContext.runAsNonRoot

msg := sprintf("Pod %v must run as non-root user (Organization Policy)", [input.request.object.metadata.name])
}

deny[msg] {
input.request.kind.kind == "Pod"
container := input.request.object.spec.containers[_]
not container.securityContext.allowPrivilegeEscalation == false

msg := sprintf("Container %v must set allowPrivilegeEscalation to false (Organization Policy)", [container.name])
}

4. Multi-Cluster Cost Optimization

CloudWatch + Cost Explorer integrated analysis:

-- CloudWatch Logs Insights: Cost driver analysis by cluster
fields @timestamp, cluster_name, namespace, pod_name, node_type, cost_per_hour
| filter event_type = "pod_usage"
| stats sum(cost_per_hour) as total_cost by cluster_name, namespace
| sort total_cost desc
| limit 10

AI-based cost optimization insights (Q Developer):

Question: "Analyze cost growth rates by cluster for the past month and suggest optimization opportunities"

Q Developer analysis:
1. Cost Explorer data analysis
- Cluster A (dev): +5% (normal range)
- Cluster B (staging): +120% (abnormal surge)
- Cluster C (prod): +15% (within expected range due to traffic growth)

2. Cost surge root cause analysis for Cluster B
- CloudWatch metrics: GPU instance (g5.xlarge) usage spike
- Log analysis: ML team running experimental workloads long-term in staging

3. Optimization recommendations
- Switch ML workloads to Spot Instances (estimated 70% cost reduction)
- Apply Karpenter to staging cluster for automatic idle node removal
- Auto scale-down development cluster during off-hours (nights/weekends)
Core Value

The key to multi-cluster AIOps is managing distributed infrastructure with a unified perspective. By centralizing data with CloudWatch Cross-Account Observability, analyzing cross-cluster correlations with Amazon Q Developer, and implementing standardized automated responses with Bedrock Agent and Strands, operational complexity does not increase linearly even as the number of clusters grows.

6.7 EventBridge-Based AI Auto-Response Patterns

Amazon EventBridge is a serverless event bus that connects events from AWS services, applications, and SaaS providers to build event-driven architectures. By integrating with EKS, you can build AI Agent workflows that automatically respond to cluster events.

EventBridge + EKS Event Integration Architecture

You can trigger automated response workflows by sending Kubernetes events from EKS clusters to EventBridge.

[EKS Cluster]
├─ Pod state changes (CrashLoopBackOff, OOMKilled, ImagePullBackOff)
├─ Node state changes (NotReady, DiskPressure, MemoryPressure)
├─ Scaling events (HPA scale up/down, Karpenter node add/remove)
└─ Security alerts (GuardDuty Findings, abnormal API calls)

[EventBridge Event Bus]
Event collection and routing

[EventBridge Rules]
Event pattern matching + filtering

[Response Workflows]
├─ Lambda → Kagent/Strands Agent invocation → Automatic diagnosis/recovery
├─ Step Functions → Multi-stage automated response workflows
├─ SNS/SQS → Notifications or asynchronous processing
└─ CloudWatch Logs → Audit and analysis

Key Event Types and Response Patterns

Event TypeDetection ConditionAuto-Response Pattern
Pod CrashLoopBackOffPod restart count > 5AI Agent analyzes logs → identifies root cause → automatic rollback or config fix
Node NotReadyNode state changeKarpenter trigger → new node provisioning, existing Pod drain
OOMKilledPod terminated due to memory shortageAI Agent analyzes memory usage patterns → auto-adjusts HPA/VPA settings
ImagePullBackOffImage pull failureLambda verifies ECR permissions → auto-fix or alert
DiskPressureNode disk usage > 85%Lambda cleans image cache → deletes temp files
GuardDuty FindingSecurity threat detectedStep Functions → Pod isolation → forensic data collection → alert

AI Agent Integration Patterns

Pattern 1: EventBridge → Lambda → AI Agent (Kagent/Strands)

Workflow:

1. EKS event occurs: Pod CrashLoopBackOff

2. EventBridge Rule match: "Pod.status.phase == 'CrashLoopBackOff'"

3. Lambda function execution:
- Collect Pod logs via EKS MCP
- Collect metrics via CloudWatch MCP
- Collect traces via X-Ray MCP

4. Kagent/Strands Agent invocation:
- AI analyzes collected context
- Root cause identification (e.g., missing ConfigMap, environment variable error)
- Execute automatic recovery or alert operations team

5. Result recording:
- Save diagnosis results to CloudWatch Logs
- Close event on successful recovery
- Escalate on recovery failure

Lambda Function Example (Python):

import boto3
import json
from kagent import KagentClient

eks_client = boto3.client('eks')
logs_client = boto3.client('logs')
kagent = KagentClient()

def lambda_handler(event, context):
# Extract Pod information from EventBridge event
detail = event['detail']
pod_name = detail['pod_name']
namespace = detail['namespace']
cluster_name = detail['cluster_name']

# Collect Pod logs (last 100 lines)
logs = get_pod_logs(cluster_name, namespace, pod_name, tail=100)

# Request diagnosis from Kagent
diagnosis = kagent.diagnose(
context={
'pod_name': pod_name,
'namespace': namespace,
'logs': logs,
'event_type': 'CrashLoopBackOff'
},
instruction="Analyze the root cause and suggest remediation"
)

# Execute AI-suggested remediation actions
if diagnosis.confidence > 0.8:
apply_remediation(diagnosis.remediation_steps)
return {'status': 'auto_remediated', 'diagnosis': diagnosis}
else:
# Alert operations team when confidence is low
notify_ops_team(diagnosis)
return {'status': 'escalated', 'diagnosis': diagnosis}
Pattern 2: EventBridge → Step Functions → Multi-Stage Auto-Response

Workflow (Node NotReady Response):

{
"Comment": "EKS node failure automatic recovery workflow",
"StartAt": "VerifyNodeStatus",
"States": {
"VerifyNodeStatus": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:function:VerifyNodeStatus",
"Next": "IsNodeRecoverable"
},
"IsNodeRecoverable": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.recoverable",
"BooleanEquals": true,
"Next": "AttemptNodeRestart"
},
{
"Variable": "$.recoverable",
"BooleanEquals": false,
"Next": "CordonAndDrainNode"
}
]
},
"AttemptNodeRestart": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:function:RestartNode",
"Next": "WaitForNodeReady"
},
"WaitForNodeReady": {
"Type": "Wait",
"Seconds": 60,
"Next": "CheckNodeRecovered"
},
"CheckNodeRecovered": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:function:CheckNodeStatus",
"Next": "NodeRecovered"
},
"NodeRecovered": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.status",
"StringEquals": "Ready",
"Next": "Success"
},
{
"Variable": "$.status",
"StringEquals": "NotReady",
"Next": "CordonAndDrainNode"
}
]
},
"CordonAndDrainNode": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:function:CordonAndDrain",
"Next": "TriggerKarpenter"
},
"TriggerKarpenter": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:function:TriggerNodeReplacement",
"Next": "Success"
},
"Success": {
"Type": "Succeed"
}
}
}

ML Inference Workload Network Performance Observability

ML inference workloads (Ray, vLLM, Triton, PyTorch, etc.) have different network characteristics from general workloads due to GPU-to-GPU communication, model parallelization, and distributed inference.

Unique Observability Requirements for ML Workloads:

MetricGeneral WorkloadsML Inference Workloads
Network bandwidthMedium (API calls)Very high (model weights, tensor transfers)
Latency sensitivityHigh (user-facing)Very high (real-time inference SLA)
Packet drop impactRecovery after retransmissionInference failure or timeout
East-West trafficLow (mostly North-South)Very high (inter-GPU node communication)
Network patternRequest-responseBurst + Sustained (model loading, inference, result aggregation)

Container Network Observability Data Utilization:

EKS Container Network Observability collects the following network metrics:

  • Pod-to-Pod network throughput (bytes/sec)
  • Network latency (p50, p99)
  • Packet drop rate
  • Retransmission rate
  • TCP connection state

ML Inference Workload Monitoring Example:

# Prometheus query example — Detecting network bottlenecks in vLLM workloads
apiVersion: v1
kind: ConfigMap
metadata:
name: ml-network-alerts
data:
alerts.yaml: |
groups:
- name: ml_inference_network
rules:
# Abnormal inter-GPU node network latency
- alert: HighInterGPULatency
expr: |
container_network_latency_p99{
workload="vllm-inference",
direction="pod-to-pod"
} > 10
for: 5m
annotations:
summary: "Inter-GPU node network latency spike"
description: "Inter-node latency for vLLM inference workload has exceeded 10ms. This may affect model parallelization performance."

# Network bandwidth saturation
- alert: NetworkBandwidthSaturation
expr: |
rate(container_network_transmit_bytes{
workload="ray-cluster"
}[5m]) > 9e9 # 9GB/s (90% of 10GbE)
for: 2m
annotations:
summary: "Ray cluster network bandwidth saturation"
description: "Network bandwidth has exceeded 90%. Consider enabling ENA Express or EFA."

EventBridge Rule: ML Network Anomaly Auto-Response

{
"source": ["aws.cloudwatch"],
"detail-type": ["CloudWatch Alarm State Change"],
"detail": {
"alarmName": ["HighInterGPULatency", "NetworkBandwidthSaturation"],
"state": {
"value": ["ALARM"]
}
}
}

Auto-response actions:

  1. Lambda function: Analyze Container Network Observability data → identify bottleneck segments
  2. AI Agent: Root cause diagnosis (CNI configuration, ENI allocation, cross-AZ communication, etc.)
  3. Automatic optimization: Enable ENA Express, configure Prefix Delegation, adjust Pod topology
GPU Workload Specifics

For GPU-based ML inference workloads, the network is the primary cause of performance bottlenecks. Due to model weights (several GB), intermediate tensors (hundreds of MB), and result aggregation, 10-100x higher network bandwidth is required compared to general workloads. Container Network Observability makes these patterns visible, and EventBridge-based auto-optimization enables real-time response.

EventBridge Rule Example: Pod CrashLoopBackOff Auto-Response

EventBridge Rule Definition (JSON):

{
"source": ["aws.eks"],
"detail-type": ["EKS Pod State Change"],
"detail": {
"clusterName": ["production-cluster"],
"namespace": ["default", "production"],
"eventType": ["Warning"],
"reason": ["BackOff", "CrashLoopBackOff"],
"involvedObject": {
"kind": ["Pod"]
}
}
}

Response Workflow (Lambda + AI Agent):

# Lambda function: EKS event → AI Agent automatic diagnosis
import boto3
import json
from strands import StrandsAgent

def lambda_handler(event, context):
detail = event['detail']

# Extract event information
cluster_name = detail['clusterName']
namespace = detail['namespace']
pod_name = detail['involvedObject']['name']
reason = detail['reason']

# Initialize Strands Agent (MCP integration)
agent = StrandsAgent(
mcp_servers=['eks-mcp', 'cloudwatch-mcp', 'xray-mcp']
)

# Request diagnosis from AI Agent
diagnosis_result = agent.run(
sop_name="eks_pod_crashloop_diagnosis",
context={
'cluster': cluster_name,
'namespace': namespace,
'pod': pod_name,
'reason': reason
}
)

# Auto-remediate or escalate based on diagnosis result
if diagnosis_result.auto_remediable:
# Execute auto-remediation
remediation_result = agent.run(
sop_name="eks_pod_auto_remediation",
context=diagnosis_result.remediation_plan
)

# Record results in CloudWatch Logs
log_remediation(diagnosis_result, remediation_result)

return {
'statusCode': 200,
'body': json.dumps({
'status': 'auto_remediated',
'diagnosis': diagnosis_result.summary,
'remediation': remediation_result.summary
})
}
else:
# Alert operations team (SNS)
notify_ops_team(diagnosis_result)

return {
'statusCode': 200,
'body': json.dumps({
'status': 'escalated',
'diagnosis': diagnosis_result.summary,
'reason': diagnosis_result.escalation_reason
})
}

Strands Agent SOP Example (YAML):

# eks_pod_crashloop_diagnosis.yaml
name: eks_pod_crashloop_diagnosis
description: "EKS Pod CrashLoopBackOff automatic diagnosis"
version: "1.0"

steps:
- name: collect_pod_logs
action: mcp_call
mcp_server: eks-mcp
tool: get_pod_logs
params:
cluster: "{{context.cluster}}"
namespace: "{{context.namespace}}"
pod: "{{context.pod}}"
tail_lines: 100
output: pod_logs

- name: collect_pod_events
action: mcp_call
mcp_server: eks-mcp
tool: get_pod_events
params:
cluster: "{{context.cluster}}"
namespace: "{{context.namespace}}"
pod: "{{context.pod}}"
output: pod_events

- name: collect_metrics
action: mcp_call
mcp_server: cloudwatch-mcp
tool: get_pod_metrics
params:
cluster: "{{context.cluster}}"
namespace: "{{context.namespace}}"
pod: "{{context.pod}}"
duration: "15m"
output: pod_metrics

- name: analyze_root_cause
action: llm_analyze
model: claude-opus-4
prompt: |
Analyze the following EKS Pod CrashLoopBackOff incident:

Pod Logs:
{{pod_logs}}

Pod Events:
{{pod_events}}

Metrics:
{{pod_metrics}}

Identify the root cause and suggest remediation.
Format: JSON with fields 'root_cause', 'confidence', 'remediation_steps', 'auto_remediable'
output: diagnosis

- name: return_result
action: return
value: "{{diagnosis}}"
Value of EventBridge + AI Agent

EventBridge-based auto-response patterns enable detecting, diagnosing, and recovering from incidents in seconds without human intervention. When integrated with AI Agents (Kagent, Strands), it goes beyond simple rule-based responses to enable intelligent automation that understands context and identifies root causes. This is the key difference between traditional automation (Runbook-as-Code) and AIOps.


7. AWS AIOps Service Map

🗺️ AWS AIOps Services Map
DevOps Guru
Detection
ML anomaly detection, EKS resource group analysis
ML Anomaly DetectionEKS Resource GroupsAuto Alerts
CloudWatch Application Signals
Observability
Zero-code instrumentation, auto SLI/SLO setup
Zero-code InstrumentationSLI/SLOAuto Dashboards
CloudWatch Investigations
Analysis
AI root cause analysis, auto incident investigation
AI Root Cause AnalysisAuto Incident InvestigationCorrelation Analysis
Amazon Q Developer
Automation
EKS troubleshooting, code generation/review
EKS TroubleshootingCode GenerationAuto Review
CloudWatch AI NL Querying
Analysis
Natural language metric/log queries
Natural Language QueryMetric AnalysisLog Search
AWS Hosted MCP Servers
Automation
EKS/Cost/Serverless MCP, AI tool integration
EKS MCPCost MCPServerless MCPAI Tool Integration

Integration Flow Between Services

AWS AIOps services provide value independently, but synergy is maximized when used together:

  1. CloudWatch Observability Agent → Metrics/logs/traces collection
  2. Application Signals → Service map + automatic SLI/SLO generation
  3. DevOps Guru → ML anomaly detection + recommended actions
  4. CloudWatch Investigations → AI root cause analysis
  5. Q Developer → Natural language-based troubleshooting
  6. Hosted MCP → Direct AWS resource access from AI tools
When Using 3rd Party Observability Stacks

Even in environments using 3rd party solutions like Datadog, Sumo Logic, or Splunk, you can send the same data to 3rd party backends by using ADOT (OpenTelemetry) as the collection layer. Since the MCP integration layer abstracts backend selection, AI tools and Agents work identically regardless of which observability stack is used.

7.7 CloudWatch Generative AI Observability

Announced: July 2025 Preview, October 2025 GA

Core Value: Goes beyond the traditional 3-Pillar observability (Metrics/Logs/Traces) by adding a fourth Pillar: AI workload-specific observability.

LLM and AI Agent Workload Monitoring

CloudWatch Generative AI Observability provides unified monitoring for LLM and AI Agent workloads running on any infrastructure — Amazon Bedrock, EKS, ECS, on-premises, and more.

Key Features:

FeatureDescription
Token consumption trackingReal-time tracking of prompt tokens, completion tokens, and total token usage
Latency analysisLatency measurement for LLM calls, Agent tool execution, and full chain
End-to-End tracingFlow tracking across the entire AI stack (prompt → LLM → tool calls → response)
Hallucination risk path detectionIdentification of execution paths with high hallucination risk
Retrieval miss identificationDetection of knowledge base search failures in RAG pipelines
Rate-limit retry monitoringTracking retry patterns caused by API rate limits
Model switching decision trackingVisibility into model selection logic in multi-model strategies

Amazon Bedrock AgentCore and External Framework Compatibility

Native integration:

  • Amazon Bedrock Data Automation MCP Server integration
  • Automatic instrumentation through AgentCore Gateway
  • Automatic observability data injection into PRs via GitHub Actions

External framework support:

  • LangChain
  • LangGraph
  • CrewAI
  • Other OpenTelemetry-based Agent frameworks

Unique Requirements for AI Observability

Unlike traditional application monitoring, AI workloads require the following unique metrics:

Traditional monitoring:
CPU/Memory/Network → Request count → Response time → Error rate

AI workload monitoring:
Above items + Token consumption + Model latency + Tool execution success rate +
Retrieval accuracy + Hallucination frequency + Context window utilization

Usage Scenario on EKS:

# AI Agent workload running on EKS
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-customer-support-agent
spec:
template:
spec:
containers:
- name: agent
image: my-ai-agent:latest
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://adot-collector:4317"
- name: CLOUDWATCH_AI_OBSERVABILITY_ENABLED
value: "true"

Once the Agent is running, CloudWatch automatically collects:

  • Full trace from customer inquiry → LLM call → knowledge base search → response generation
  • Token consumption and cost at each step
  • Paths with high hallucination probability (e.g., LLM answering with general knowledge after a Retrieval Miss)
AI Observability is Key to Cost Optimization

LLM API calls are billed per token. CloudWatch Gen AI Observability visualizes which prompts consume excessive tokens and which tool combinations are inefficient, enabling 20-40% cost reduction for AI workloads.

References:

7.8 GuardDuty Extended Threat Detection — EKS Security Observability

Announced: June 2025 EKS support, December 2025 EC2/ECS expansion

Core Value: Integrates security anomaly detection with operational anomaly detection to achieve holistic observability.

AI/ML-Based Multi-Stage Attack Detection

GuardDuty Extended Threat Detection correlates multiple data sources to detect sophisticated attacks that are easily missed by traditional security monitoring.

Correlated Data Sources:

Data SourceDetection Content
EKS audit logsAbnormal API call patterns (e.g., privilege escalation attempts, unauthorized Secret access)
Runtime behaviorAbnormal process execution within containers, unexpected network connections
Malware executionDetection of known/unknown malware signatures
AWS API activityTemporal correlation analysis between CloudTrail events and EKS activity

Attack Sequence Findings — Multi-Resource Threat Identification

Limitations of single event detection:

Traditional security monitoring:
Event 1: Pod connects to external IP → Alert
Event 2: IAM role temporary credential request → Alert
Event 3: S3 bucket object listing → Alert

Problem: Each event may appear normal individually → False positives

Attack Sequence Findings approach:

GuardDuty AI analysis:
Event 1 + Event 2 + Event 3 connected temporally and logically
→ "Data Exfiltration attack sequence" detected
→ Single Critical Severity Finding generated

GuardDuty automatically identifies attack chains spanning multiple resources (Pods, nodes, IAM roles, S3 buckets) and data sources (EKS logs, CloudTrail, VPC Flow Logs).

Real-World Case: November 2025 Cryptomining Campaign Detection

Background: A large-scale cryptomining attack campaign targeting Amazon EC2 and ECS began on November 2, 2025.

Attack Sequence:

  1. Initial intrusion: Exploiting publicly available vulnerable container images
  2. Privilege acquisition: IAM credential theft via IMDS (Instance Metadata Service)
  3. Lateral movement: Starting other EC2 instances/ECS tasks with acquired credentials
  4. Cryptomining execution: Deploying mining software on high-performance instances

GuardDuty Detection Mechanism:

Detection StageMethod
Abnormal behavior identificationContainer attempting unexpected connections to external mining pools
Credential misuse detectionSurge in IMDS call frequency + API calls at abnormal hours
Resource spike correlation analysis100% CPU usage + known mining process signatures
Attack chain reconstructionConnecting events in temporal order to present complete attack scenario

Result: GuardDuty automatically detected the attack, AWS issued warnings to customers, and this prevented potential cost losses of millions of dollars.

References:

AIOps Perspective: Integration of Security Observability

Traditional separation model:

Security team → GuardDuty, Security Hub
Operations team → CloudWatch, Prometheus
Result: Security anomalies and operational anomalies reported separately → Delayed correlation

AIOps integrated model:

GuardDuty Extended Threat Detection (security anomalies)

CloudWatch Investigations (AI root cause analysis)

Operational metrics (CPU, memory, network) + security event integrated analysis

"CPU spike caused by cryptomining, not normal traffic" automatic determination

Usage on EKS:

# Enable GuardDuty Extended Threat Detection for EKS
aws guardduty create-detector \
--enable \
--features '[{"Name":"EKS_RUNTIME_MONITORING","Status":"ENABLED"}]'

# Forward detected threats to CloudWatch Events
aws events put-rule \
--name guardduty-eks-threats \
--event-pattern '{"source":["aws.guardduty"],"detail-type":["GuardDuty Finding"]}'

Once enabled, GuardDuty continuously monitors all workloads in the EKS cluster, and AI automatically performs the first analysis stage, significantly reducing the operations team's response time.

Security Observability = Operational Observability

Security anomalies (e.g., cryptomining) often manifest first as operational anomalies (e.g., CPU spikes, network traffic anomalies). When GuardDuty Extended Threat Detection is integrated with CloudWatch, the operations team can immediately get the answer "security threat" to the question "Why is this Pod's CPU at 100%?"

References:

For detailed observability stack construction methods and stack selection patterns, refer to 2. Intelligent Observability Stack.


8. AIOps Maturity Model

📊 AIOps Maturity Model
Evolution: Level 0 (Manual) → Level 4 (Autonomous)
Level 0
Manual
Manual monitoring, kubectl-based, reactive to failures
kubectlManual dashboardsManual alerts
Level 1
Reactive
Managed Add-ons + AMP/AMG, dashboard-based alerting
Managed Add-onsAMPAMGDashboard alerts
Level 2
Declarative
Managed Argo CD + ACK + KRO, GitOps declarative automation
Argo CDACKKROGitOps
Level 3
Predictive
CloudWatch AI + Q Developer, ML anomaly detection + predictive analytics
CloudWatch AIQ DeveloperML Anomaly DetectionPredictive Analytics
Level 4
Autonomous
Kiro + MCP + AI Agent expansion, autonomous operations
KiroMCPQ DeveloperStrandsKagent

Maturity Level Transition Guide

Level 0 → Level 1 Transition (Fastest ROI)

You can establish an observability foundation just by adopting Managed Add-ons and AMP/AMG. Deploy ADOT and CloudWatch Observability Agent with the aws eks create-addon command, and build centralized dashboards with AMP/AMG.

# Level 1 start: Deploy core observability add-ons
aws eks create-addon --cluster-name my-cluster --addon-name adot
aws eks create-addon --cluster-name my-cluster --addon-name amazon-cloudwatch-observability
aws eks create-addon --cluster-name my-cluster --addon-name eks-node-monitoring-agent

Level 1 → Level 2 Transition (Automation Foundation)

Introduce GitOps with Managed Argo CD, and declaratively manage AWS resources as K8s CRDs with ACK. Configuring composite resources as single deployment units with KRO greatly improves infrastructure change consistency and traceability.

Level 2 → Level 3 Transition (Intelligent Analysis)

Enable CloudWatch AI and DevOps Guru to start ML-based anomaly detection and predictive analytics. Introduce AI root cause analysis with CloudWatch Investigations, and leverage natural language-based troubleshooting with Q Developer.

Level 3 → Level 4 Transition (Autonomous Operations)

Build a programmatic operations framework with Kiro + Hosted MCP, and deploy Kagent/Strands Agents to enable AI to autonomously handle incident response, deployment verification, and resource optimization.

Gradual Adoption Recommended

Do not attempt to leap from Level 0 to Level 4 all at once. It is more likely to succeed when you accumulate sufficient operational experience and data at each level before transitioning to the next. The safety verification of AI autonomous recovery is especially critical for the Level 3 → Level 4 transition.


9. ROI Assessment

💰 AIOps ROI Key Metrics
-81%
MTTR Improvement
4 hours45 min
-90%
MTTD Improvement
30 min3 min
-90%
Alert Noise Reduction
500/day50/day
-35%
Cost Reduction
Over-provisioningAI Right-Sizing

ROI Assessment Framework

A framework for systematically evaluating the ROI of AIOps adoption.

Quantitative Metrics

AIOps ROI Quantitative Metrics
Measurable improvement results
Metric
Measurement Method
Target Improvement
MTTD
Mean Time to Detect
Time from anomaly occurrence → detection
80-90% reduction
MTTR
Mean Time to Resolve
Time from detection → resolution
70-80% reduction
Alert Noise
Alert Noise Reduction
Ratio of daily alerts requiring action
80-90% reduction
Incident Recurrence
Incident Recurrence Rate
Recurrence rate of same incident type
60-70% reduction
Cost Efficiency
Cost Efficiency
Actual utilization vs infrastructure cost
30-40% improvement
Measurement Baseline: Calculate improvement rate by comparing 3-month average before vs after AIOps adoption. Also track qualitative metrics (ops team satisfaction, deployment confidence, etc.).

Qualitative Metrics

  • Operations team satisfaction: Reduced repetitive tasks, focus on strategic work
  • Deployment confidence: Improved deployment quality through automated verification
  • Incident response quality: Increased root cause resolution rate
  • Knowledge management: AI Agents learn response patterns to accumulate organizational knowledge

Cost Structure Considerations

AIOps Cost Structure Considerations
Key cost items & optimization methods
Cost Item
Description
Optimization Method
AMP Ingestion Cost
Based on metric sample count
Filter unnecessary metrics, adjust collection frequency
AMG User Cost
Based on active user count
SSO integration, viewer/editor role separation
DevOps Guru
Based on analyzed resource count
Enable only core resource groups
CloudWatch
Based on log/metric volume
Filter logs, adjust metric resolution
Cost Optimization Strategy: Initially validate value with full activation, then gradually optimize costs by filtering unnecessary metrics and logs based on data. Analyze cost structure using AWS Cost Explorer and CloudWatch Contributor Insights.

9.1 AIOps ROI In-Depth Analysis Model

An in-depth analysis model for quantitatively and qualitatively evaluating the value of AIOps adoption. It goes beyond simple cost reduction to encompass improvements in organizational agility and innovation capabilities.

Quantitative ROI Calculation Formulas

1. Incident Response Cost Reduction

Annual savings from MTTR reduction = (Previous MTTR - New MTTR) × Annual incident count × Hourly response cost

Practical example:
- Previous MTTR: Average 2 hours
- MTTR after AIOps adoption: Average 20 minutes (0.33 hours)
- Annual P1/P2 incidents: 120
- Hourly response cost: $150 (operations team of 3 × $50/hour)

Savings = (2 - 0.33) × 120 × $150 = $30,060/year

2. Business Loss Reduction from Outages

Annual downtime loss reduction = (Previous annual downtime - New annual downtime) × Hourly revenue loss

Practical example:
- Previous annual downtime: 8 hours (MTTR 2 hours × 2 per month × 12 months ÷ 6 major incidents)
- After AIOps adoption: 1.3 hours (MTTR 20 minutes × same frequency)
- Hourly revenue loss: $50,000 (assuming e-commerce platform)

Loss reduction = (8 - 1.3) × $50,000 = $335,000/year

3. Personnel Efficiency Gains from Operations Automation

Operations team productivity improvement value = Saved repetitive task hours × Hourly labor cost × Strategic work value multiplier

Practical example:
- Automated repetitive tasks: 40 hours per week (4 people × 10 hours/week)
- Hourly labor cost: $50
- Strategic work value multiplier: 1.5x (strategic work is 50% more valuable than repetitive tasks)

Annual value = 40 × 52 × $50 × 1.5 = $156,000/year

4. Infrastructure Cost Reduction from Predictive Scaling

Annual infrastructure cost savings = Unnecessary over-provisioning cost - Cost after prediction-based optimization

Practical example:
- Previous: Always 3x over-provisioned for peak handling → $30,000/month
- AIOps predictive scaling: Auto scale-up 5 minutes before peak → average 1.2x provisioning → $12,000/month

Savings = ($30,000 - $12,000) × 12 = $216,000/year

Comprehensive Quantitative ROI:

ItemAnnual Savings/Value
Incident response cost reduction$30,060
Downtime loss reduction$335,000
Operations team productivity improvement$156,000
Infrastructure cost reduction$216,000
Total annual value$737,060

AIOps Adoption Costs:

ItemAnnual Cost
AWS managed services (AMP/AMG/DevOps Guru)$50,000
Bedrock Agent API call costs$20,000
Additional CloudWatch log/metric storage$10,000
Initial implementation consulting (one-time)$30,000
Total annual cost$110,000

ROI Calculation:

ROI = (Total annual value - Total annual cost) / Total annual cost × 100%
= ($737,060 - $110,000) / $110,000 × 100%
= 570%

Payback period = Total annual cost / Monthly average value
= $110,000 / ($737,060 / 12)
= 1.8 months
Cautions for ROI Calculation

The formulas above are examples assuming a mid-sized organization (100-500 employees, $50M-$200M annual revenue). Actual ROI varies significantly based on:

  • Organization size and incident frequency
  • Actual impact of business downtime (e-commerce vs SaaS vs internal tools)
  • Existing operational maturity (starting from Level 0 vs Level 2)
  • Number and complexity of clusters

Small startups (<50 employees) may have smaller absolute amounts but relatively higher ROI, while large enterprises (>1000 employees) may see absolute amounts 10x or more larger.

Qualitative Value: Reduced Team Burnout, Improved Developer Experience

Qualitative values that are difficult to measure with quantitative metrics but have a decisive impact on long-term organizational performance.

1. Reduced Operations Team Burnout

MetricBefore AIOpsAfter AIOpsImprovement
Night alert frequencyAverage 8 per weekAverage 1 per week85% reduction via AI Agent auto-response
Weekend emergency responsesAverage 4 per monthAverage 0.5 per monthPreemptive action via predictive analytics
Repetitive task ratio60% of work hours15% of work hours45pp reduction via automation
Operations team turnover rate25% annually8% annuallyImproved job satisfaction
On-call stress score7.8/10 (high)3.2/10 (low)Significantly reduced stress via autonomous recovery

Business Impact:

  • Reduced operations expert turnover → Annual recruitment/training cost savings: $120,000 (assuming 40% of average salary)
  • Prevention of productivity decline from burnout → Difficult to quantify but improves organizational health

2. Developer Experience (DX) Improvement

MetricBefore AIOpsAfter AIOpsImprovement
Deployment confidence50% (high anxiety)90% (high trust)Automated verification and rollback
Failure root cause identification timeAverage 45 minutesAverage 5 minutesAI root cause analysis
Infrastructure inquiry response timeAverage 2 hoursInstant (Q Developer)Self-service enabled
Deployment frequency2 per week3 per dayMore frequent deployments due to improved safety
Developer satisfaction6.2/108.7/10Infrastructure complexity abstraction

Business Impact:

  • Increased deployment frequency → Faster feature delivery → Strengthened market competitiveness
  • Developers focus on business logic instead of infrastructure debugging → Improved product quality

3. Knowledge Management and Organizational Learning

MetricBefore AIOpsAfter AIOpsImprovement
Incident response pattern documentationManual, incompleteAI Agent auto-learningKnowledge loss prevention
New operator onboarding period3 months1 monthAI assistant provides real-time guidance
Recurring failure rate40%5%Learned response patterns automatically applied
Best practices adoption rate30%85%AI automatically applies them

Business Impact:

  • Organizational knowledge accumulates in the system → Reduced dependency on key personnel
  • New team members achieve productivity quickly → Increased organizational scalability

4. Innovation Capacity Improvement

When the operations team is freed from repetitive tasks through AIOps adoption, they can focus on strategic work.

Redirected Time UsageOrganizational Value
New service experimentation2x improvement in new feature delivery speed
Architecture optimization20% improvement in infrastructure efficiency
Security hardening70% reduction in vulnerability response time
Cost optimization analysis15% annual infrastructure cost reduction
Team capability developmentEnhanced cloud-native expertise
Actual Impact of Qualitative Value

Netflix's Chaos Engineering team invested 60% of the time saved through operations automation into improving system resilience, ultimately improving annual uptime from 99.9% to 99.99% (Netflix case study). This is a representative example of qualitative investment converting to quantitative results.

Investment vs. Impact Analysis by Stage (Per Maturity Level)

Analyzing investment scale and expected impact for each level of the AIOps maturity model (Section 8).

Level 0 → Level 1 Transition

Investment Items:

ItemCostNotes
Managed Add-ons deployment (ADOT, CloudWatch Agent)$0Add-ons themselves are free, only data collection costs
AMP/AMG initial configuration$5,000Dashboard construction consulting
CloudWatch log/metric increase$3,000/monthObservability data collection costs
Total initial investment$5,000 + $3,000/month

Expected Impact:

ImpactMeasurement MetricExpected Improvement
Observability visibilityMetric coverage30% → 95%
Incident detection timeFailure awareness speedAverage 30 min → 5 min
Dashboard construction timeNew service monitoring2 days → 2 hours (using AMG templates)

ROI: Payback period approximately 3-4 months. Eliminating blind spots from lack of observability is the core value.

Level 1 → Level 2 Transition

Investment Items:

ItemCostNotes
Managed Argo CD configuration$2,000GitOps workflow construction
ACK + KRO adoption$3,000IaC transition consulting
Converting existing manual deployments to IaC$10,000Terraform/Pulumi migration
Total initial investment$15,000

Expected Impact:

ImpactMeasurement MetricExpected Improvement
Deployment time reductionInfrastructure change durationAverage 2 hours → 10 min
Deployment error reductionFailures from config inconsistencies3 per month → 0.2 per month
Rollback speedRecovery time on issuesAverage 45 min → 5 min

ROI: Payback period approximately 2-3 months. Deployment automation drastically reduces human errors.

Level 2 → Level 3 Transition

Investment Items:

ItemCostNotes
CloudWatch AI + DevOps Guru activation$8,000/monthML anomaly detection service billing
Q Developer integration$5,000Initial setup and MCP integration
Kiro + EKS MCP server construction$15,000Spec-driven workflow construction
Total initial investment$20,000 + $8,000/month

Expected Impact:

ImpactMeasurement MetricExpected Improvement
Root cause analysis speedRCA durationAverage 2 hours → 10 min
Prediction accuracyPre-failure detection rate0% → 60%
Incident response MTTRAverage recovery time2 hours → 30 min

ROI: Payback period approximately 4-6 months. ML-based predictive analytics is the core value.

Level 3 → Level 4 Transition

Investment Items:

ItemCostNotes
Bedrock Agent construction$25,000Autonomous operations Agent development
Strands/Kagent SOPs development$20,000Auto-recovery scenario implementation
Bedrock Agent API call costs$10,000/monthProduction workload billing
Safety verification and testing$15,000Thorough verification before production deployment
Total initial investment$60,000 + $10,000/month

Expected Impact:

ImpactMeasurement MetricExpected Improvement
Auto-recovery rateAgent autonomous resolution rate0% → 70%
Incident response MTTRAverage recovery time30 min → 5 min
Night/weekend alertsOn-call burden8 per week → 1 per week

ROI: Payback period approximately 6-9 months. Initial investment is large, but autonomous operations deliver the greatest long-term cost savings.

Cumulative ROI Comparison by Level:

Maturity LevelCumulative Initial InvestmentMonthly Operating CostAnnual Savings/ValuePayback Period
Level 1$5,000$3,000$100,0003-4 months
Level 2$20,000$3,000$250,0002-3 months (cumulative)
Level 3$40,000$11,000$500,0004-6 months (cumulative)
Level 4$100,000$21,000$737,0006-9 months (cumulative)
Gradual Investment Strategy

Level 0 → Level 1 can be started immediately with fast ROI and low risk. Level 2 → Level 3 should proceed after the organization has developed some automation capabilities, and Level 4 should be adopted after sufficient data accumulation and safety verification. We recommend accumulating at least 6 months of operational experience at each level before transitioning to the next stage.


10. Conclusion

AIOps is an operational paradigm that maximizes the powerful capabilities and extensibility of the K8s platform with AI, while reducing operational complexity and accelerating innovation.

Key Summary

  1. AWS Open Source Strategy: Managed Add-ons + Managed Open Source (AMP/AMG/ADOT) → Eliminates operational complexity
  2. EKS Capabilities: Managed Argo CD + ACK + KRO → Core components of declarative automation
  3. Kiro + Hosted MCP: Spec-driven programmatic operations → Cost-effective and rapid response
  4. AI Agent Extension: Q Developer (GA) + Strands (OSS) + Kagent (Early) → Gradual autonomous operations

Next Steps

Next Steps
AIOps & AIDLC series learning path
Order
Document
Key Content
Next
Build integrated architecture with ADOT, AMP, AMG, CloudWatch AI
Then
Kiro spec-driven development, EKS Capabilities GitOps integration
Finally
ML predictive scaling, AI Agent auto incident response
Learning Tip: Each document builds on previous content, so sequential learning is recommended. For actual implementation, proceed in order: Build observability stack → Apply AIDLC → Expand predictive operations.

References