Building an EKS Intelligent Observability Stack

Written: 2026-02-12 | Updated: 2026-02-14 | Reading time: ~45 min

1. Overview

In modern distributed systems, Observability goes beyond simple monitoring — it refers to the ability to understand a system's internal state through its external outputs. In EKS environments, the combination of hundreds of Pods, complex service meshes, and dynamic scaling makes it difficult to identify the root cause of problems with traditional monitoring alone.

1.1 3-Pillar Observability + AI Analysis Layer

Combining the three pillars of observability with an AI analysis layer enables truly intelligent operations.

3-Pillar Observability + AI Analysis Layer

Combining the three pillars of observability with AI analysis

Pillar

Role

AWS Services

Metrics

Numerical time-series data

AMP (Amazon Managed Prometheus), CloudWatch Metrics

Logs

Event-based text data

CloudWatch Logs, OpenSearch

Traces

Distributed request tracing

AWS X-Ray, ADOT

AI Analysis

ML-based anomaly detection and insights

DevOps Guru, CloudWatch AI, Q Developer

Scope of This Document

This document covers the entire process of building an intelligent observability stack in an EKS environment, from Managed Add-on based observability foundations to the AI analysis layer. It focuses on strategies where AWS operates open-source observability tools as managed services to eliminate complexity while maximizing K8s-native observability. While this document is based on the AWS native stack, the same architecture can be applied with 3rd-party backends by using ADOT (OpenTelemetry) as the collection layer.

1.3 Observability Stack Selection Patterns

In real-world EKS operational environments, three major observability stack patterns are used depending on organizational requirements and existing investments:

Observability Stack Selection Patterns

Three strategies based on organizational requirements

Pattern

Collection Layer

Backend

Suitable Environment

AWS Native

CloudWatch Observability Agent

CloudWatch Logs/Metrics, X-Ray

Teams with high AWS service dependency preferring single console management

OSS-Centric

ADOT (OpenTelemetry)

AMP (Prometheus), AMG (Grafana), X-Ray

Prefer K8s-native tools, multi-cloud strategy, minimize vendor lock-in

3rd Party

ADOT or vendor-specific agents

Datadog, Sumo Logic, Splunk, New Relic, etc.

Organizations with existing 3rd party investments or preferring unified SaaS dashboards

💡 Key Point: Using ADOT (OpenTelemetry) as the collection layer allows flexible backend switching. This is why AWS provides OpenTelemetry as a Managed Add-on instead of their own agent.

The Key to the Collection Layer: ADOT (OpenTelemetry)

Regardless of which backend you choose, using ADOT (OpenTelemetry) for the collection layer gives you the freedom to switch backends. Since OpenTelemetry is a CNCF standard, it can export data to most backends including Prometheus, Jaeger, Datadog, Sumo Logic, and more. This is why AWS provides OpenTelemetry as a Managed Add-on (ADOT) instead of its own proprietary agent.

This document explains configurations based on the AWS Native and OSS-centric patterns. If you use a 3rd-party backend, you can leverage the same collection pipeline by simply changing the ADOT Collector's exporter settings.

1.2 Why Observability Matters in EKS

Observability in EKS environments is essential for the following reasons:

Dynamic infrastructure: Pods are constantly created/deleted, and nodes are dynamically provisioned by Karpenter
Microservice complexity: Complex inter-service call chains make it difficult to identify single points of failure
Multi-layer issues: Multiple layers including applications, container runtime, nodes, network, and AWS services
Cost optimization: Right-sizing based on resource usage pattern analysis
Regulatory compliance: Audit logs, access records, and other compliance requirements

2. Managed Add-ons Based Observability Foundation

EKS Managed Add-ons eliminate operational complexity by having AWS manage the installation, upgrades, and patches of observability agents. With a single aws eks create-addon command, you can establish a production-grade observability foundation.

EKS Managed Add-ons — Observability Layer

Establish production observability foundation with one line: aws eks create-addon

Add-on

Collection Targets

Key Features

Install Command

ADOT

GAApplication

OpenTelemetry-based metrics/traces/logs collection

Metrics, Traces, Logs

OTel standard, built-in SigV4 auth, multi-backend support

aws eks create-addon --addon-name adot

CloudWatch Agent

GAApplication

Container Insights Enhanced + Application Signals

Metrics, Logs, Traces (App Signals)

Auto-instrumentation, SLI/SLO, service map

aws eks create-addon --addon-name amazon-cloudwatch-observability

Node Monitoring

GAInfrastructure

Node-level hardware/OS monitoring

NVMe, Memory, Kernel, OOM

Proactive hardware failure detection, EDAC events

aws eks create-addon --addon-name eks-node-monitoring-agent

NFM Agent

GANetwork

Container Network Observability — Pod-level network metrics

Network Flows, Cross-AZ Traffic

K8s context mapping, Cross-AZ cost visibility

aws eks create-addon --addon-name aws-network-flow-monitoring-agent

GuardDuty Agent

GASecurity

Runtime security threat detection

Runtime Events, Syscalls

ML-based threat detection, crypto mining detection

aws eks create-addon --addon-name aws-guardduty-agent

💡 Recommendation: Enabling all 5 Add-ons provides observability across all layers: infrastructure, network, application, and security. AWS manages version control and security patches for all Add-ons.

2.1 ADOT (AWS Distro for OpenTelemetry) Add-on

ADOT is the AWS distribution of OpenTelemetry that collects metrics, logs, and traces with a single agent.

# Install ADOT Add-on
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name adot \
  --addon-version v0.40.0-eksbuild.1 \
  --service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/adot-collector-role

# Verify installation
aws eks describe-addon \
  --cluster-name my-cluster \
  --addon-name adot \
  --query 'addon.status'

ADOT vs Self-managed OpenTelemetry Deployment

Using the ADOT Add-on automatically installs the OpenTelemetry Operator with built-in AWS service authentication (SigV4). This significantly reduces operational overhead compared to self-managed deployments, and EKS version compatibility is guaranteed by AWS.

2.2 CloudWatch Observability Agent Add-on

The CloudWatch Observability Agent provides an integrated offering of Container Insights Enhanced, Application Signals, and CloudWatch Logs.

# CloudWatch Observability Agent Add-on
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name amazon-cloudwatch-observability \
  --service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/cloudwatch-agent-role

# Verify configuration
kubectl get pods -n amazon-cloudwatch

2.3 Node Monitoring Agent Add-on (2025)

The Node Monitoring Agent detects hardware and OS-level issues on EC2 nodes.

# Node Monitoring Agent Add-on
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name eks-node-monitoring-agent

Key detection items:

NVMe disk errors: Proactive detection of EBS volume performance degradation
Memory hardware errors: EDAC (Error Detection and Correction) events
Kernel soft lockups: CPU held abnormally long
OOM (Out of Memory): Process termination due to memory exhaustion

2.3.1 Integration of Node Readiness Controller with Observability

Node Readiness Controller (NRC) is a controller introduced as Beta in Kubernetes 1.32 that automatically manages node taints based on node issues reported by Node Problem Detector (NPD). This is a core component of the Closed-Loop Observability pattern that connects observability data to automatic remediation.

Role in the Observability Pipeline:

Collection: Node Monitoring Agent Add-on detects hardware/OS issues
Reporting: NPD reports status to the K8s API as Node Conditions
Detection: NRC monitors Node Condition changes
Action: NRC automatically applies/removes the node.kubernetes.io/unschedulable taint
Observation: CloudWatch Container Insights and AMP track taint change events
Alerting: SNS/EventBridge notifies the operations team of node state changes

CloudWatch Container Insights Integration:

# Query NRC-related node taint change events with CloudWatch Logs Insights
aws logs start-query \
  --log-group-name /aws/containerinsights/my-cluster/application \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date -u +%s) \
  --query-string '
fields @timestamp, kubernetes.node_name, message
| filter message like /NoSchedule/
| filter message like /node.kubernetes.io\/unschedulable/
| sort @timestamp desc
'

# Example output:
# 2026-02-12 10:23:45 | node-abc123 | Taint added: node.kubernetes.io/unschedulable:NoSchedule (DiskPressure)
# 2026-02-12 10:28:12 | node-abc123 | Taint removed: node.kubernetes.io/unschedulable (DiskPressure resolved)

Prometheus Metrics Collection:

NRC operates as part of the kube-controller-manager and exposes the following metrics:

# Collect NRC metrics with ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: node-readiness-controller
  namespace: kube-system
spec:
  selector:
    matchLabels:
      component: kube-controller-manager
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s

# Key metrics:
# - node_readiness_controller_reconcile_total: Number of NRC reconciliation executions
# - node_readiness_controller_reconcile_duration_seconds: Reconciliation processing time
# - node_readiness_controller_taint_changes_total: Number of taint applies/removals

AMG (Amazon Managed Grafana) Dashboard Visualization:

{
  "dashboard": {
    "title": "Node Readiness & Health",
    "panels": [
      {
        "title": "Nodes with Unschedulable Taints",
        "targets": [{
          "expr": "count(kube_node_spec_taint{key='node.kubernetes.io/unschedulable'})"
        }]
      },
      {
        "title": "NRC Reconciliation Rate",
        "targets": [{
          "expr": "rate(node_readiness_controller_reconcile_total[5m])"
        }]
      },
      {
        "title": "Node Condition Changes (24h)",
        "targets": [{
          "expr": "increase(node_readiness_controller_taint_changes_total[24h])"
        }]
      }
    ]
  }
}

EventBridge-based Alert Automation:

# EventBridge Rule: SNS alert on NRC taint changes
apiVersion: v1
kind: ConfigMap
metadata:
  name: eventbridge-rule
data:
  rule.json: |
    {
      "source": ["aws.eks"],
      "detail-type": ["EKS Node Taint Change"],
      "detail": {
        "taintKey": ["node.kubernetes.io/unschedulable"],
        "action": ["added", "removed"]
      }
    }
---
# Send alerts to SNS topic
# Alert example:
# [ALERT] Node ip-10-0-1-45.ap-northeast-2.compute.internal
# Taint added: node.kubernetes.io/unschedulable:NoSchedule
# Reason: DiskPressure detected by Node Monitoring Agent
# Action: Pods will not be scheduled until condition resolves

Utilizing Dry-run Mode (Pre-production Validation):

NRC supports three modes:

Mode	Description	When to Use
`dry-run`	Simulates taint changes only (no actual application)	Assess impact scope before production deployment
`bootstrap-only`	Applies taints only during cluster boot	Use only during initial node preparation phase
`continuous`	Continuously monitors node state and manages taints	Production environment (recommended)

# Enable NRC in dry-run mode (impact scope simulation)
kubectl patch deployment kube-controller-manager \
  -n kube-system \
  --type='json' \
  -p='[{
    "op": "add",
    "path": "/spec/template/spec/containers/0/command/-",
    "value": "--feature-gates=NodeReadinessController=true"
  },{
    "op": "add",
    "path": "/spec/template/spec/containers/0/command/-",
    "value": "--node-readiness-controller-mode=dry-run"
  }]'

# Analyze dry-run results with CloudWatch Logs Insights
aws logs start-query \
  --log-group-name /aws/containerinsights/my-cluster/application \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date -u +%s) \
  --query-string '
fields @timestamp, message
| filter message like /dry-run/
| filter message like /would add taint/
| stats count() by kubernetes.node_name
'

# Output: Confirm the number of taints to be applied per node
# -> Decide to switch to continuous mode after assessing impact scope

Gradual Rollout Strategy:

Dry-run phase: Monitor simulation results in observability dashboards
Bootstrap-only phase: Apply taints only during node boot to assess initial impact
Continuous phase: Fully activate in production environment with continuous monitoring

Best Practice for Observability to Auto-Remediation

NRC is an excellent example of the Closed-Loop Observability pattern that performs automatic actions based on observability data. When the Node Monitoring Agent detects a problem, NRC automatically isolates the node to maintain workload stability. This is a core component of Self-Healing Infrastructure where the system recovers on its own without human intervention.

Reference

Kubernetes Blog: Introducing Node Readiness Controller

2.4 Container Network Observability (2025.11)

Container Network Observability, announced at re:Invent in November 2025, provides network visibility with K8s context in EKS environments. While traditional VPC Flow Logs only showed IP-level traffic, Container Network Observability provides Pod-to-Pod, Pod-to-Service, Pod-to-external service level network flows along with K8s metadata (namespace, service name, Pod labels).

# Install Network Flow Monitoring Agent Add-on
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name aws-network-flow-monitoring-agent

# Enable Container Network Observability in VPC CNI
aws eks update-addon \
  --cluster-name my-cluster \
  --addon-name vpc-cni \
  --configuration-values '{"enableNetworkPolicy":"true"}'

Key features:

Pod-level network metrics: Track TCP retransmissions, packet drops, and connection latency at the Pod/Service level
Cross-AZ traffic visibility: Measure cross-AZ data transfer volumes per service to identify unnecessary Cross-AZ costs
K8s context network map: Automatically map namespace, service name, and Pod labels to network flows
AWS service communication tracking: Analyze traffic patterns from Pods to AWS services like S3, RDS, DynamoDB
Preferred observability stack integration: Send metrics to any backend including AMP/Grafana, CloudWatch, Datadog

Enhanced Network Security Policies (2025.12)

Along with Container Network Observability, EKS also introduced Enhanced Network Security Policies. These allow centralized application of network access filters across the entire cluster and fine-grained control of external traffic with DNS-based egress policies. They operate on top of VPC CNI's Network Policy capabilities.

Key Message

With just 5 observability Managed Add-ons, you establish the observability foundation across all layers: infrastructure (Node Monitoring), network (NFM Agent -> Container Network Observability), and application (ADOT, CloudWatch Agent). All are deployed with a single aws eks create-addon command, and version management and security patches are handled by AWS.

2.6 CloudWatch Generative AI Observability

CloudWatch Generative AI Observability, which started as Preview in July 2025 and reached GA in October, provides a new observability dimension for AI/ML workloads. It adds AI workload-specific observability to the existing 3-Pillar observability (Metrics, Logs, Traces), ushering in the era of 4-Pillar observability.

2.6.1 Core Features

LLM and AI Agent Monitoring:

Monitor LLMs and AI Agents running on all infrastructure including Amazon Bedrock, EKS, ECS, and on-premises
Token consumption tracking (input/output token counts, cost per token)
Inference latency analysis (request-response time, P50/P90/P99 latency)
End-to-end tracing for full AI stack visibility

AI Workflow-Specific Observability:

Hallucination risk path detection: Identify paths where the model is likely to generate incorrect information
Retrieval miss identification: Track search failures in RAG (Retrieval-Augmented Generation) systems
Rate-limit retry monitoring: Analyze retry patterns due to API limits
Model-switch decision tracking: Monitor logic for switching between multiple models

Amazon Bedrock AgentCore Integration:

Provides ready-to-use views for Agent workflows, Knowledge Base, and Tool usage
Cross-tool prompt flow visibility
External framework support (LangChain, LangGraph, CrewAI)

2.6.2 4-Pillar Observability Architecture

Differentiators of AI Observability

Traditional 3-Pillar observability observes a system's behavior, while AI observability observes the model's decision-making and quality. For example, API latency (traditional) and inference quality (AI-specific) are different observation targets.

2.6.3 Activation Method

# Enable CloudWatch Generative AI Observability (EKS workloads)
# Add AI Observability Exporter to ADOT Collector
kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: adot-ai-observability
  namespace: observability
spec:
  mode: deployment
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: "0.0.0.0:4317"

    processors:
      batch:
        timeout: 10s

    exporters:
      awsxray:
        region: ap-northeast-2
        indexed_attributes:
          - "gen_ai.system"
          - "gen_ai.request.model"
          - "gen_ai.usage.input_tokens"
          - "gen_ai.usage.output_tokens"

      awscloudwatch:
        region: ap-northeast-2
        namespace: "GenAI/Observability"
        metric_declarations:
          - dimensions:
              - ["service.name", "gen_ai.request.model"]
            metric_name_selectors:
              - "gen_ai.usage.input_tokens"
              - "gen_ai.usage.output_tokens"
              - "gen_ai.request.duration"

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [awsxray]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [awscloudwatch]
EOF

2.6.4 MCP Integration and Automation

CloudWatch Generative AI Observability integrates with the Bedrock Data Automation MCP server to allow direct querying of AI observability data from AI clients like Kiro and Amazon Q Developer.

[Scenario: LLM Inference Latency Increase]

Kiro + MCP Auto Analysis:
1. CloudWatch MCP: query_ai_metrics("inference_latency") -> P99 500ms -> 2.3s increase
2. CloudWatch MCP: get_ai_traces(service="recommendation-llm") -> Token count spike confirmed
3. CloudWatch MCP: check_hallucination_risk() -> High risk for certain prompt patterns
4. Bedrock MCP: get_model_config() -> Excessive max_tokens model parameter setting

-> Kiro automatically:
   - Creates PR to optimize max_tokens limit
   - Suggests prompt engineering improvements
   - Adds alternative model (smaller model) usage logic

GitHub Action Integration

CloudWatch Generative AI Observability provides a GitHub Action to automatically add AI observability data to PRs. It automatically displays token consumption, latency changes, and hallucination risk changes on model change PRs to assess impact before deployment.

2.6.5 Real-World Use Cases

Case 1: RAG System Search Quality Monitoring

[Problem Discovery]
Retrieval miss rate: 15% -> 35% spike (within 2 hours)

[CloudWatch AI Observability Analysis]
- Knowledge Base index not updated for 7 days
- Pattern detected: queries for latest documents failing
- Embedding model version mismatch confirmed

[Auto Remediation]
-> Knowledge Base re-indexing triggered
-> Embedding model synchronized
-> Retrieval miss rate restored to 15%

Case 2: Token Cost Optimization

[Cost Anomaly Detection]
Daily token consumption: $500 -> $2,300 (460% increase)

[Root Cause Analysis]
- Specific prompt template outputting an average of 5,000 tokens (normal: 500)
- Repetitive prompt chains maintaining unnecessarily long context

[Optimization Result]
-> Prompt template refactored
-> Dynamic context window adjustment
-> Cost reduced to $600/day (74% savings)

:::

3. Overall Architecture

The EKS intelligent observability stack consists of 5 layers.

🏗️ Observability Architecture Layers

Collection → Transport → Storage → Analysis → Action

Collection

Generate and collect observability data

ADOT CollectorCloudWatch AgentFluent BitNode Monitoring Agent

↓

Transport

Send collected data to backends

OTLP/gRPCPrometheus Remote WriteCloudWatch APIX-Ray API

↓

Storage

Long-term storage of observability data

AMP (Prometheus)CloudWatch Logs/MetricsX-Ray TracesS3

↓

Analysis

Query and visualize data

AMG (Grafana)CloudWatch AIDevOps GuruQ Developer

↓

Action

Insight-driven automation

Kiro + MCPAI AgentsAuto-remediationEscalation

3.1 Data Flow Summary

Data Flow Summary

Roles and components of 5 layers

Layer

Components

Role

Collection

ADOT, CW Agent, Fluent Bit, Node Monitor, Flow Monitor

Collect metrics/logs/traces/events

Transport

OTLP, Remote Write, CW API, X-Ray API

Deliver data via standard protocols

Storage

AMP, CloudWatch Logs/Metrics, X-Ray

Time-series storage and indexing

Analysis

AMG, CloudWatch AI, DevOps Guru, Application Signals

AI/ML-based analysis and visualization

Action

Hosted MCP, Kiro, Q Developer, Kagent

AI-based auto-response and remediation

4. ADOT Collector Deployment

4.1 OpenTelemetryCollector CRD

Installing the ADOT Add-on also deploys the OpenTelemetry Operator, allowing declarative management of collectors through the OpenTelemetryCollector CRD.

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: adot-collector
  namespace: observability
spec:
  mode: deployment
  replicas: 2
  resources:
    limits:
      cpu: "1"
      memory: 2Gi
    requests:
      cpu: 200m
      memory: 512Mi
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: "0.0.0.0:4317"
          http:
            endpoint: "0.0.0.0:4318"
      prometheus:
        config:
          scrape_configs:
            - job_name: 'kubernetes-pods'
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
                  action: keep
                  regex: true
    processors:
      batch:
        timeout: 10s
        send_batch_size: 1024
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
        spike_limit_mib: 128
      resource:
        attributes:
          - key: cluster.name
            value: "my-eks-cluster"
            action: upsert
          - key: aws.region
            value: "ap-northeast-2"
            action: upsert
      filter:
        metrics:
          exclude:
            match_type: regexp
            metric_names:
              - "go_.*"
              - "process_.*"
    exporters:
      prometheusremotewrite:
        endpoint: "https://aps-workspaces.ap-northeast-2.amazonaws.com/workspaces/ws-xxxxx/api/v1/remote_write"
        auth:
          authenticator: sigv4auth
        resource_to_telemetry_conversion:
          enabled: true
      awsxray:
        region: ap-northeast-2
        indexed_attributes:
          - "otel.resource.service.name"
          - "otel.resource.deployment.environment"
      awscloudwatchlogs:
        region: ap-northeast-2
        log_group_name: "/eks/my-cluster/application"
        log_stream_name: "otel-logs"
    extensions:
      sigv4auth:
        region: ap-northeast-2
        service: aps
      health_check:
        endpoint: "0.0.0.0:13133"
    service:
      extensions: [sigv4auth, health_check]
      pipelines:
        metrics:
          receivers: [otlp, prometheus]
          processors: [memory_limiter, filter, batch, resource]
          exporters: [prometheusremotewrite]
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch, resource]
          exporters: [awsxray]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch, resource]
          exporters: [awscloudwatchlogs]

4.2 DaemonSet Mode Deployment

Use DaemonSet mode when per-node metric collection is needed.

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: adot-node-collector
  namespace: observability
spec:
  mode: daemonset
  hostNetwork: true
  volumes:
    - name: hostfs
      hostPath:
        path: /
  volumeMounts:
    - name: hostfs
      mountPath: /hostfs
      readOnly: true
  env:
    - name: K8S_NODE_NAME
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName
  config:
    receivers:
      hostmetrics:
        root_path: /hostfs
        collection_interval: 30s
        scrapers:
          cpu: {}
          disk: {}
          filesystem: {}
          load: {}
          memory: {}
          network: {}
      kubeletstats:
        collection_interval: 30s
        auth_type: serviceAccount
        endpoint: "https://${env:K8S_NODE_NAME}:10250"
        insecure_skip_verify: true
    processors:
      batch:
        timeout: 30s
      resourcedetection:
        detectors: [env, eks]
    exporters:
      prometheusremotewrite:
        endpoint: "https://aps-workspaces.ap-northeast-2.amazonaws.com/workspaces/ws-xxxxx/api/v1/remote_write"
        auth:
          authenticator: sigv4auth
    extensions:
      sigv4auth:
        region: ap-northeast-2
        service: aps
    service:
      extensions: [sigv4auth]
      pipelines:
        metrics:
          receivers: [hostmetrics, kubeletstats]
          processors: [resourcedetection, batch]
          exporters: [prometheusremotewrite]

Deployment vs DaemonSet Selection Criteria

Deployment mode: Application metrics/traces collection (OTLP reception), centralized processing
DaemonSet mode: Node-level metrics collection (hostmetrics, kubeletstats), network efficient
Sidecar mode: Collect logs/traces for specific Pods only, when isolation is needed

4.3 Pipeline Configuration Principles

The ADOT Collector pipeline processes data in the order receivers -> processors -> exporters.

+---------------+    +----------------+    +---------------+
|  Receivers    |--->|  Processors    |--->|  Exporters    |
|               |    |                |    |               |
| - otlp        |    | - memory_      |    | - prometheus  |
| - prometheus  |    |   limiter      |    |   remotewrite |
| - hostmetrics |    | - batch        |    | - awsxray     |
| - kubelet     |    | - filter       |    | - cwlogs      |
|   stats       |    | - resource     |    |               |
+---------------+    +----------------+    +---------------+

Key Processor Settings:

Core Processor Settings

ADOT Collector Pipeline Optimization

Processor

Role

Recommended Settings

memory_limiter

Prevent OOM

limit_mib: 512, spike_limit: 128

batch

Network efficiency

timeout: 10s, batch_size: 1024

filter

Remove unnecessary metrics

Exclude go_*, process_*

resource

Add metadata

Attach cluster.name, region

resourcedetection

Auto-detect environment

Enable EKS, EC2 detectors

5. AMP + AMG Integration

5.1 AMP (Amazon Managed Prometheus)

AMP is a Prometheus-compatible managed service that stores and queries metrics at scale without infrastructure management.

# Create AMP workspace
aws amp create-workspace \
  --alias my-eks-observability \
  --tags Environment=production

# Check workspace ID
aws amp list-workspaces \
  --query 'workspaces[?alias==`my-eks-observability`].workspaceId' \
  --output text

5.2 Remote Write Configuration

Remote write configuration for sending metrics from ADOT to AMP.

# Prometheus remote_write configuration
remoteWrite:
  - url: "https://aps-workspaces.ap-northeast-2.amazonaws.com/workspaces/ws-xxxxx/api/v1/remote_write"
    sigv4:
      region: ap-northeast-2
    queue_config:
      max_samples_per_send: 1000
      max_shards: 200
      capacity: 2500
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "go_.*|process_.*"
        action: drop

Remote Write Cost Optimization

AMP charges based on the number of ingested metric samples. You can reduce costs by 30-50% by dropping unnecessary metrics (go_, process_) via write_relabel_configs. Additionally, increasing the scrape_interval from 15s to 30s halves the number of samples.

5.3 AMG (Amazon Managed Grafana) Data Source Connection

Add AMP as a data source in AMG.

# Create AMG workspace
aws grafana create-workspace \
  --workspace-name my-eks-grafana \
  --account-access-type CURRENT_ACCOUNT \
  --authentication-providers AWS_SSO \
  --permission-type SERVICE_MANAGED \
  --workspace-data-sources PROMETHEUS CLOUDWATCH XRAY

# Auto-configure data source (AMP connection)
aws grafana create-workspace-service-account \
  --workspace-id g-xxxxxxxxxx \
  --grafana-role ADMIN \
  --name amp-datasource

After adding the AMP data source in AMG, here are the essential PromQL queries you can use.

5.4 Essential PromQL Queries

# Top 10 Pod CPU usage
topk(10,
  sum(rate(container_cpu_usage_seconds_total{namespace!="kube-system"}[5m])) by (pod)
)

# Memory usage per node
100 * (1 - (
  node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
))

# HTTP request error rate (5xx)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# P99 latency
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

# Pod restart count (last 1 hour)
increase(kube_pod_container_status_restarts_total[1h])

# Karpenter node provisioning wait time
histogram_quantile(0.95,
  sum(rate(karpenter_provisioner_scheduling_duration_seconds_bucket[10m])) by (le)
)

Core Value of AMP + AMG

AWS handles all infrastructure management for Prometheus and Grafana (scaling, patching, high availability, backups). Teams can focus solely on dashboard configuration and query writing, allowing them to concentrate on the essential value of observability. This is the core of AWS's strategy of "maintaining the benefits of open source while eliminating complexity."

5.5 Grafana Alloy: Next-Generation Collector Pattern

Grafana Alloy is the successor to Grafana Agent, officially announced in April 2024. It supports both OpenTelemetry and Prometheus collection and enables more flexible pipeline configuration based on Flow mode.

5.5.1 Grafana Alloy vs ADOT Comparison

Feature	ADOT (AWS Perspective)	Grafana Alloy	Recommended Scenario
Management	EKS Managed Add-on	Self-deployed (Helm)	ADOT: When AWS integration is priority
Backend Focus	AWS services (AMP, CloudWatch, X-Ray)	Grafana Cloud, Prometheus, Loki	Alloy: When centered on Grafana ecosystem
OpenTelemetry Support	Native (OTEL Collector based)	Native (OTEL Receiver built-in)	Equal
Prometheus Collection	(prometheus receiver)	(prometheus.scrape)	Alloy is lighter and faster
Log Collection	CloudWatch Logs, S3	Loki, CloudWatch Logs	Alloy: Loki optimized
Tracing	X-Ray, OTLP	Tempo, Jaeger, OTLP	Alloy: Tempo optimized
Configuration	YAML (OTEL Collector standard)	River language (declarative + dynamic)	Alloy is more intuitive
AWS IAM Integration	SigV4 built-in	Manual setup required	ADOT is much simpler
Resource Usage	Medium (Go-based)	Low (optimized Go)	Alloy uses ~30% less

ADOT vs Grafana Alloy Selection Guide

Choose ADOT when:

You want the convenience of AWS Managed Add-on
You primarily use AMP + CloudWatch + X-Ray as backends
You want automatic AWS IAM authentication handling
You want AWS-guaranteed EKS version compatibility

Choose Grafana Alloy when:

You use Grafana Cloud or a self-hosted Grafana stack
You're building a complete open-source stack with Loki + Tempo + Mimir
Lighter resource usage is important (cost-sensitive)
You need dynamic configuration features of River language

5.5.2 Deploying Grafana Alloy on EKS

# Add Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Grafana Alloy
helm install grafana-alloy grafana/alloy \
  --namespace observability \
  --create-namespace \
  --set alloy.configMap.content='
logging {
  level = "info"
  format = "logfmt"
}

// Prometheus metrics collection
prometheus.scrape "kubernetes_pods" {
  targets = discovery.kubernetes.pods.targets
  forward_to = [prometheus.remote_write.amp.receiver]

  clustering {
    enabled = true
  }
}

// Kubernetes Pod auto-discovery
discovery.kubernetes "pods" {
  role = "pod"

  selectors {
    role  = "pod"
    field = "spec.nodeName=" + env("HOSTNAME")
  }
}

// Send metrics to AMP (SigV4 authentication)
prometheus.remote_write "amp" {
  endpoint {
    url = "https://aps-workspaces.ap-northeast-2.amazonaws.com/workspaces/ws-xxxxx/api/v1/remote_write"

    sigv4 {
      region = "ap-northeast-2"
    }
  }
}

// Send logs to Loki
loki.source.kubernetes "pods" {
  targets    = discovery.kubernetes.pods.targets
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "https://logs-prod-012.grafana.net/loki/api/v1/push"

    basic_auth {
      username = env("LOKI_USERNAME")
      password = env("LOKI_PASSWORD")
    }
  }
}

// Receive OpenTelemetry traces
otelcol.receiver.otlp "default" {
  grpc {
    endpoint = "0.0.0.0:4317"
  }

  http {
    endpoint = "0.0.0.0:4318"
  }

  output {
    traces  = [otelcol.exporter.otlp.tempo.input]
  }
}

otelcol.exporter.otlp "tempo" {
  client {
    endpoint = "tempo.grafana.net:443"

    auth {
      authenticator = otelcol.auth.basic.tempo.handler
    }
  }
}

otelcol.auth.basic "tempo" {
  username = env("TEMPO_USERNAME")
  password = env("TEMPO_PASSWORD")
}
'

5.5.3 AMP + Alloy Combination vs AMP + ADOT Combination

Scenario 1: AMP + Grafana Alloy

Pros:
- 30% reduction in resource usage (CPU/Memory)
- Excellent Prometheus collection performance (100K samples/second)
- Dynamic configuration with River language (config changes without redeployment)

Cons:
- Manual AWS IAM authentication setup required (SigV4 credential management)
- No EKS Managed Add-on support (manual upgrades)
- Complex CloudWatch Logs integration (additional setup required)

Scenario 2: AMP + ADOT

Pros:
- Fully automated management as EKS Managed Add-on
- AWS IAM integration (automatic SigV4, IRSA support)
- Native CloudWatch + X-Ray integration
- AWS support and compatibility guarantee

Cons:
- Slightly higher resource usage than Alloy
- YAML-centric configuration (not as flexible as River)

Practical Recommendation

Hybrid approach: It's also possible to collect metrics with Grafana Alloy and send them to AMP, while collecting traces and logs with ADOT and sending them to X-Ray and CloudWatch. This is a strategy that leverages each tool's strengths.

5.5.4 Integration with Grafana Cloud

When using Grafana Cloud, Alloy can configure a complete observability stack with Loki + Tempo + Mimir.

# Grafana Cloud integration example (alloy-config.river)
prometheus.remote_write "grafana_cloud" {
  endpoint {
    url = "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"

    basic_auth {
      username = env("GRAFANA_CLOUD_PROMETHEUS_USERNAME")
      password = env("GRAFANA_CLOUD_API_KEY")
    }
  }
}

loki.write "grafana_cloud" {
  endpoint {
    url = "https://logs-prod-eu-west-0.grafana.net/loki/api/v1/push"

    basic_auth {
      username = env("GRAFANA_CLOUD_LOKI_USERNAME")
      password = env("GRAFANA_CLOUD_API_KEY")
    }
  }
}

otelcol.exporter.otlp "grafana_cloud_traces" {
  client {
    endpoint = "tempo-prod-04-prod-eu-west-0.grafana.net:443"

    auth {
      authenticator = otelcol.auth.basic.grafana_cloud.handler
    }
  }
}

Advantages of Grafana Cloud:

Fully managed: No infrastructure management for Loki, Tempo, Mimir
Unified view: Explore metrics, logs, and traces in a single Grafana UI
Free tier: 10K time series, 50GB logs, 50GB traces per month free
Global high availability: Automatic replication across multiple regions

Cost Comparison (monthly, small-to-medium EKS cluster):

Item	AMP + AMG	Grafana Cloud	Self-hosted Grafana
Metrics (100K samples/sec)	$50-80	$60-100	$150-200 (EC2 cost)
Logs (50GB/month)	$25 (CloudWatch)	$30 (Loki)	$100 (EBS + instances)
Traces (10K spans/sec)	$15 (X-Ray)	$20 (Tempo)	$50 (EBS + instances)
Management overhead	Low	Very low	High
Total estimated cost	$90-120	$110-150	$300-350

6. CloudWatch Cross-Account Observability

6.1 The Need for Multi-Account Observability

In large organizations, AWS accounts are separated for security, isolation, and cost management. However, when observability data is distributed across accounts, the following problems arise:

No unified view: Metrics/logs from multiple accounts must be checked in separate consoles
Difficult correlation analysis: Cross-account service call tracing is impossible
Alert management complexity: Duplicate alert configuration management per account
Reduced operational efficiency: Navigating between multiple accounts to identify root causes during incidents

AWS provides centralized observability through CloudWatch Cross-Account Observability.

6.2 Cross-Account Architecture

+-------------------------------------------------------------+
|                   Monitoring Account                         |
|  +--------------------------------------------------------+ |
|  |         CloudWatch (Centralized View)                   | |
|  |  - Unified metrics/logs/traces from all accounts        | |
|  |  - Unified dashboards and alerts                        | |
|  +--------------------------------------------------------+ |
|                          ^                                   |
|                    OAM Links                                 |
+---------------------------+----------------------------------+
                            |
        +-------------------+-------------------+
        |                   |                   |
+-------v------+  +---------v-----+  +---------v-----+
| Source Acct A |  | Source Acct B  |  | Source Acct C  |
| (EKS Dev)    |  | (EKS Staging) |  | (EKS Prod)    |
|              |  |               |  |               |
| ADOT         |  | ADOT          |  | ADOT          |
| Container    |  | Container     |  | Container     |
| Insights     |  | Insights      |  | Insights      |
+--------------+  +---------------+  +---------------+

6.3 OAM (Observability Access Manager) Configuration

6.3.1 Create Sink in Monitoring Account

# Execute in Monitoring account
aws oam create-sink \
  --name central-observability-sink \
  --tags Key=Environment,Value=production

# Check Sink ARN (used in Source accounts)
SINK_ARN=$(aws oam list-sinks \
  --query 'Items[0].Arn' \
  --output text)

echo $SINK_ARN
# arn:aws:oam:ap-northeast-2:MONITORING_ACCOUNT_ID:sink/sink-id

6.3.2 Sink Policy Configuration (Access Authorization)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::SOURCE_ACCOUNT_A:root",
          "arn:aws:iam::SOURCE_ACCOUNT_B:root",
          "arn:aws:iam::SOURCE_ACCOUNT_C:root"
        ]
      },
      "Action": [
        "oam:CreateLink",
        "oam:UpdateLink"
      ],
      "Resource": "arn:aws:oam:ap-northeast-2:MONITORING_ACCOUNT_ID:sink/*",
      "Condition": {
        "ForAllValues:StringEquals": {
          "oam:ResourceTypes": [
            "AWS::CloudWatch::Metric",
            "AWS::Logs::LogGroup",
            "AWS::XRay::Trace"
          ]
        }
      }
    }
  ]
}

# Apply Sink Policy
aws oam put-sink-policy \
  --sink-identifier $SINK_ARN \
  --policy file://sink-policy.json

6.3.3 Create Link in Source Accounts

# Execute in each Source account A, B, C
aws oam create-link \
  --label-template '$AccountName-$Region' \
  --resource-types "AWS::CloudWatch::Metric" \
                   "AWS::Logs::LogGroup" \
                   "AWS::XRay::Trace" \
  --sink-identifier arn:aws:oam:ap-northeast-2:MONITORING_ACCOUNT_ID:sink/sink-id \
  --tags Key=Account,Value=dev

# Check Link status
aws oam list-links \
  --query 'Items[*].[Label,ResourceTypes,SinkArn]' \
  --output table

How OAM Links Work

OAM Links stream observability data from Source accounts to the Monitoring account. Data is retained in the Source accounts as well, while the Monitoring account provides a unified view. This is a logical connection, not data replication.

6.4 Unified Dashboard Configuration

Configure all accounts' data into a single dashboard from CloudWatch in the Monitoring account.

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "metrics": [
          [ { "accountId": "SOURCE_ACCOUNT_A", "expression": "SELECT AVG(pod_cpu_utilization) FROM SCHEMA(\"ContainerInsights\", ClusterName,Namespace,PodName) WHERE ClusterName = 'dev-cluster'" } ],
          [ { "accountId": "SOURCE_ACCOUNT_B", "expression": "SELECT AVG(pod_cpu_utilization) FROM SCHEMA(\"ContainerInsights\", ClusterName,Namespace,PodName) WHERE ClusterName = 'staging-cluster'" } ],
          [ { "accountId": "SOURCE_ACCOUNT_C", "expression": "SELECT AVG(pod_cpu_utilization) FROM SCHEMA(\"ContainerInsights\", ClusterName,Namespace,PodName) WHERE ClusterName = 'prod-cluster'" } ]
        ],
        "view": "timeSeries",
        "region": "ap-northeast-2",
        "title": "Pod CPU Usage Across All Environments",
        "period": 300
      }
    }
  ]
}

6.5 Cross-Account X-Ray Tracing

Cross-Account X-Ray configuration is needed to trace inter-service calls in multi-account environments.

# Source account ADOT Collector settings
exporters:
  awsxray:
    region: ap-northeast-2
    # Enable Cross-Account tracing
    role_arn: arn:aws:iam::MONITORING_ACCOUNT_ID:role/XRayCrossAccountRole
    indexed_attributes:
      - "aws.account_id"
      - "otel.resource.service.name"

Monitoring Account IAM Role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::SOURCE_ACCOUNT_A:root"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

6.6 Cost Considerations

Cross-Account Observability incurs additional costs for data transfer and storage.

Cost Item	Description	Estimated Monthly Cost (per cluster)
OAM Link	Free (only data transfer costs apply)	$0
Cross-Region transfer	When sending to Monitoring account in a different region	$0.01/GB (~$50-150)
CloudWatch storage	Metrics storage in central account	Same as existing costs
X-Ray traces	Cross-Account trace storage	$5.00/million traces recorded

Cost Optimization Strategies

Same-Region configuration: Place the Monitoring account in the same region as Source accounts to eliminate data transfer costs
Metric filtering: Select only necessary resources when creating OAM Links (e.g., include X-Ray only for production)
Sampling: Adjust X-Ray sampling rate (default 1req/s -> 0.1req/s)

6.7 Production Operation Patterns

Pattern 1: Environment-Based Account Separation + Centralized Observability

Dev Account (111111111111)
  +-- EKS Cluster: dev-cluster
       +-- OAM Link -> Monitoring Account

Staging Account (222222222222)
  +-- EKS Cluster: staging-cluster
       +-- OAM Link -> Monitoring Account

Prod Account (333333333333)
  +-- EKS Cluster: prod-cluster
       +-- OAM Link -> Monitoring Account

Monitoring Account (444444444444)
  +-- CloudWatch Unified Dashboard
  +-- Unified Alerts (SNS -> Slack)
  +-- X-Ray Service Map (All Environments)

Pattern 2: Team-Based Account Separation + Shared Observability

Team-A Account (Frontend)
Team-B Account (Backend)
Team-C Account (Data)
  +-- Each team's EKS + ADOT
       +-- OAM Link -> Shared Monitoring Account

Shared Monitoring Account
  +-- Per-team filtered dashboards
  +-- Per-team alert routing

7. CloudWatch Container Insights Enhanced

6.1 Enhanced Container Insights Features

On EKS 1.28+, Enhanced Container Insights provides deep observability including Control Plane metrics.

# Install CloudWatch Observability Operator (Helm)
helm install amazon-cloudwatch-observability \
  oci://public.ecr.aws/cloudwatch-agent/amazon-cloudwatch-observability \
  --namespace amazon-cloudwatch --create-namespace \
  --set clusterName=my-cluster \
  --set region=ap-northeast-2 \
  --set containerInsights.enhanced=true \
  --set containerInsights.acceleratedCompute=true

6.2 Collected Metrics Scope

Scope of metrics collected by Enhanced Container Insights:

Enhanced Container Insights Metrics Scope

Deep observability including EKS 1.28+ Control Plane

6.3 EKS Control Plane Metrics

Control Plane metrics automatically collected on EKS 1.28+ are essential for understanding cluster health.

# Verify Control Plane metrics activation
aws eks describe-cluster \
  --name my-cluster \
  --query 'cluster.logging.clusterLogging[?types[?contains(@, `api`)]]'

Key Control Plane metrics:

API Server: apiserver_request_total, apiserver_request_duration_seconds -- API server load and latency
etcd: etcd_db_total_size_in_bytes, etcd_server_slow_apply_total -- etcd health and performance
Scheduler: scheduler_schedule_attempts_total, scheduler_scheduling_duration_seconds -- Scheduling efficiency
Controller Manager: workqueue_depth, workqueue_adds_total -- Controller queue status

Cost Considerations

Enhanced Container Insights collects a large volume of metrics, which increases CloudWatch costs. Production clusters may incur an additional $50-200/month. It's recommended to use basic Container Insights for dev/staging environments and enable Enhanced only for production.

6.4 Windows Workload Container Insights Support

On August 5, 2025, AWS announced CloudWatch Container Insights for EKS Windows Workloads Monitoring. This is an important development that provides a unified observability experience for EKS clusters running mixed Linux and Windows workloads.

6.4.1 Mixed Cluster Observability Strategy

Many enterprises run legacy .NET Framework applications and new Linux-based microservices on the same EKS cluster. Container Insights' Windows support enables building a single observability platform for such mixed environments.

# Deploy Container Insights Agent to Windows nodes
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cloudwatch-agent-windows
  namespace: amazon-cloudwatch
spec:
  selector:
    matchLabels:
      name: cloudwatch-agent-windows
  template:
    metadata:
      labels:
        name: cloudwatch-agent-windows
    spec:
      nodeSelector:
        kubernetes.io/os: windows
      serviceAccountName: cloudwatch-agent
      containers:
        - name: cloudwatch-agent
          image: public.ecr.aws/cloudwatch-agent/cloudwatch-agent:latest-windows
          env:
            - name: HOST_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: HOST_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: K8S_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          volumeMounts:
            - name: cwagentconfig
              mountPath: C:\ProgramData\Amazon\CloudWatch\cwagentconfig.json
              subPath: cwagentconfig.json
            - name: rootfs
              mountPath: C:\rootfs
              readOnly: true
      volumes:
        - name: cwagentconfig
          configMap:
            name: cwagent-config-windows
        - name: rootfs
          hostPath:
            path: C:\
            type: Directory

6.4.2 Windows-Specific Metrics

Container Insights collects Windows-specific performance counters and system metrics on Windows nodes:

Metric Category	Key Metrics	Description
.NET CLR	`dotnet_clr_memory_heap_size_bytes`	Managed heap size of .NET applications
	`dotnet_clr_gc_collections_total`	Garbage collection count (Gen 0/1/2)
	`dotnet_clr_exceptions_thrown_total`	Total number of exceptions thrown
IIS	`iis_current_connections`	Active HTTP connection count
	`iis_requests_total`	Total HTTP requests processed
	`iis_request_errors_total`	HTTP error response count (4xx, 5xx)
Windows System	`windows_cpu_processor_utility`	CPU usage (%)
	`windows_memory_available_bytes`	Available memory
	`windows_net_bytes_total`	Network bytes sent/received
Container	`container_memory_working_set_bytes`	Windows container memory working set
	`container_cpu_usage_seconds_total`	Container CPU usage time

# Windows-specific metrics collection configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: cwagent-config-windows
  namespace: amazon-cloudwatch
data:
  cwagentconfig.json: |
    {
      "metrics": {
        "namespace": "ContainerInsights",
        "metrics_collected": {
          "statsd": {
            "service_address": ":8125",
            "metrics_collection_interval": 60,
            "metrics_aggregation_interval": 60
          },
          "Performance Counters": {
            "metrics_collection_interval": 60,
            "counters": [
              {
                "counter_name": "\\Processor(_Total)\\% Processor Time",
                "metric_name": "windows_cpu_processor_utility"
              },
              {
                "counter_name": "\\Memory\\Available MBytes",
                "metric_name": "windows_memory_available_bytes"
              },
              {
                "counter_name": "\\.NET CLR Memory(_Global_)\\# Bytes in all Heaps",
                "metric_name": "dotnet_clr_memory_heap_size_bytes"
              },
              {
                "counter_name": "\\.NET CLR Exceptions(_Global_)\\# of Exceps Thrown / sec",
                "metric_name": "dotnet_clr_exceptions_thrown_total"
              },
              {
                "counter_name": "\\Web Service(_Total)\\Current Connections",
                "metric_name": "iis_current_connections"
              },
              {
                "counter_name": "\\Web Service(_Total)\\Total Method Requests",
                "metric_name": "iis_requests_total"
              }
            ]
          }
        }
      }
    }

6.4.3 Mixed Cluster Dashboard Configuration

Recommended dashboard configuration for unified monitoring of Linux and Windows nodes from the CloudWatch console:

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "Cluster CPU Usage (by OS)",
        "metrics": [
          [ "ContainerInsights", "node_cpu_utilization",
            { "stat": "Average", "label": "Linux Nodes" },
            { "dimensions": { "ClusterName": "my-cluster", "NodeOS": "linux" } }
          ],
          [ ".", "windows_cpu_processor_utility",
            { "stat": "Average", "label": "Windows Nodes" },
            { "dimensions": { "ClusterName": "my-cluster", "NodeOS": "windows" } }
          ]
        ],
        "period": 300,
        "region": "ap-northeast-2"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": ".NET Application Garbage Collection",
        "metrics": [
          [ "ContainerInsights", "dotnet_clr_gc_collections_total",
            { "dimensions": { "ClusterName": "my-cluster", "Generation": "0" } }
          ],
          [ "...", { "Generation": "1" } ],
          [ "...", { "Generation": "2" } ]
        ],
        "period": 60
      }
    },
    {
      "type": "log",
      "properties": {
        "title": "Windows Container Error Logs",
        "query": "SOURCE '/aws/containerinsights/my-cluster/application'\n| fields @timestamp, kubernetes.pod_name, log\n| filter kubernetes.host like /windows/\n| filter log like /ERROR|Exception/\n| sort @timestamp desc\n| limit 50",
        "region": "ap-northeast-2"
      }
    }
  ]
}

Core Value of CloudWatch Container Insights Windows Support

CloudWatch Container Insights has officially supported Windows workloads since August 2025. The ability to monitor Linux and Windows nodes in the same dashboard greatly reduces mixed cluster operational complexity. Windows-specific metrics like .NET CLR and IIS performance counters are automatically collected, establishing the observability foundation for Kubernetes migration of legacy .NET Framework applications.

Mixed Cluster Operation Recommendations

Node pool separation: Separate Windows and Linux workloads into distinct node pools (Karpenter NodePool) while monitoring them under the same Container Insights namespace. This allows selecting optimized instance types for each OS while maintaining observability on a single platform.

Alert strategy: Configure Windows-specific metric alerts (e.g., .NET GC Gen 2 frequency increase) and Linux metric alerts separately, but route them to the same SNS topic so the operations team receives all alerts through a single channel.

7. CloudWatch Application Signals

Application Signals automatically generates service maps, SLI/SLO, and call graphs for applications with zero-code instrumentation.

7.1 Supported Languages and Instrumentation Methods

Application Signals Supported Languages

Zero-code instrumentation support status

Language

Instrumentation Method

Status

Java

ADOT Java Agent auto-injection

Python

ADOT Python Auto-instrumentation

.NET

ADOT .NET Auto-instrumentation

Node.js

ADOT Node.js Auto-instrumentation

💡 Zero-code Instrumentation: Simply add annotations to Pods via Instrumentation CRD and instrumentation agents are automatically injected. Service maps and SLI/SLO are generated without code changes.

7.2 Activation Method

# Enable zero-code instrumentation with Instrumentation CRD
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: app-signals
  namespace: my-app
spec:
  exporter:
    endpoint: http://adot-collector.observability:4317
  propagators:
    - tracecontext
    - baggage
    - xray
  java:
    image: public.ecr.aws/aws-observability/adot-autoinstrumentation-java:latest
    env:
      - name: OTEL_AWS_APPLICATION_SIGNALS_ENABLED
        value: "true"
      - name: OTEL_METRICS_EXPORTER
        value: "none"
  python:
    image: public.ecr.aws/aws-observability/adot-autoinstrumentation-python:latest

Adding an annotation to the Pod automatically injects the instrumentation agent:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-java-app
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-java: "app-signals"
    spec:
      containers:
        - name: app
          image: my-java-app:latest

7.3 Automatic Service Map Generation

When Application Signals is enabled, the following are automatically generated:

Service Map: Visualizes inter-service call relationships, displays error rates/latency
Automatic SLI Configuration: Automatically measures availability (error rate), latency (P99), and throughput
SLO Configuration: Sets targets based on SLIs (e.g., 99.9% availability, P99 < 500ms)
Call Graph: Traces inter-service call paths for individual requests

Application Signals + DevOps Guru Integration

When DevOps Guru analyzes Application Signals SLI data, service-level anomaly detection becomes possible. For example, you can receive service-context alerts such as "Payment service P99 latency has increased 3x compared to normal."

8. DevOps Guru EKS Integration

Amazon DevOps Guru uses ML to automatically detect operational anomalies and analyze root causes.

8.1 Resource Group Configuration

# Enable DevOps Guru with EKS cluster-based resource group
aws devops-guru update-resource-collection \
  --action ADD \
  --resource-collection '{
    "Tags": {
      "TagValues": [
        {
          "AppBoundaryKey": "eks-cluster",
          "TagValues": ["my-cluster"]
        }
      ]
    }
  }'

8.2 How ML Anomaly Detection Works

DevOps Guru's anomaly detection operates in the following stages:

Learning Period (1-2 weeks): Learns normal operational patterns with ML models
Anomaly Detection: Detects metric changes deviating from learned patterns
Correlation Analysis: Groups simultaneously occurring anomalous metrics
Root Cause Inference: Analyzes causal relationships between anomalous metrics
Insight Generation: Sends alerts with recommended actions

8.3 Real Anomaly Detection Scenario

Scenario: EKS Node Memory Pressure

[DevOps Guru Insight]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Severity: HIGH
Type: Reactive Anomaly

Related Metrics (Correlation Analysis):
  ✦ node_memory_utilization: 92% → 98% (abnormal increase)
  ✦ pod_eviction_count: 0 → 5 (abnormal increase)
  ✦ container_restart_count: 2 → 18 (abnormal increase)
  ✦ kube_node_status_condition{condition="MemoryPressure"}: 0 → 1

Root Cause Analysis:
  → Memory utilization of node i-0abc123 exceeded the normal range
    (60-75%), entering MemoryPressure state
  → Pods without memory requests set are consuming excessive memory

Recommended Actions:
  1. Identify Pods without memory requests/limits set
  2. Set namespace default limits through LimitRange
  3. Add memory-based scaling configuration to Karpenter NodePool
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

8.4 Cost and Activation Tips

DevOps Guru Cost and Activation

ML Anomaly Detection Service Pricing Structure

Item

Description

Billing Criteria

Based on number of analyzed AWS resources (per hour)

Estimated Cost

~$50/month for 100 resources

Free Tier

First 3 months free trial

Activation Recommendation

Enable only on production clusters

8.5 DevOps Guru Cost Structure and Optimization

Understanding Amazon DevOps Guru's billing model accurately allows you to maximize the benefits of ML-based anomaly detection without exceeding budget.

8.5.1 Billing Model Details

DevOps Guru uses a Resource-Hour billing model. This is based on the time AWS resources being analyzed are monitored by DevOps Guru.

Monthly Cost = Number of analyzed resources × Hours × Regional hourly rate

Regional hourly rate (ap-northeast-2):
- $0.0028 per resource-hour

Cost Estimation Examples:

[Scenario 1: Small Production Cluster]
Analyzed resources:
- EKS Cluster: 1
- EC2 Nodes: 10
- RDS Instances: 2
- Lambda Functions: 5
- DynamoDB Tables: 3
- ALB: 2
Total Resources: 23

Monthly Cost:
23 resources × 24 hours × 30 days × $0.0028 = $46.37/month

[Scenario 2: Medium Production Cluster]
Analyzed resources:
- EKS Cluster: 1
- EC2 Nodes: 50
- RDS Instances: 5
- Lambda Functions: 20
- DynamoDB Tables: 10
- ALB/NLB: 5
- ElastiCache: 3
Total Resources: 94

Monthly Cost:
94 resources × 24 hours × 30 days × $0.0028 = $189.50/month

[Scenario 3: Large Production Environment]
Analyzed resources:
- EKS Clusters: 3
- EC2 Nodes: 200
- RDS Instances: 15
- Lambda Functions: 100
- DynamoDB Tables: 30
- Other Resources: 50
Total Resources: 398

Monthly Cost:
398 resources × 24 hours × 30 days × $0.0028 = $801.79/month

8.5.2 Cost Optimization Strategies

Strategy 1: Selective Activation by Environment

# Enable DevOps Guru only for production environment
aws devops-guru update-resource-collection \
  --action ADD \
  --resource-collection '{
    "Tags": {
      "TagValues": [
        {
          "AppBoundaryKey": "Environment",
          "TagValues": ["production"]
        }
      ]
    }
  }'

# Exclude development/staging environments
# → Can reduce resource count by 60-70%

Strategy 2: CloudFormation Stack-Based Scoping

# Analyze only specific CloudFormation stacks
aws devops-guru update-resource-collection \
  --action ADD \
  --resource-collection '{
    "CloudFormation": {
      "StackNames": [
        "eks-production-cluster",
        "rds-production-database",
        "api-gateway-production"
      ]
    }
  }'

# Advantage: Focus costs on monitoring only core infrastructure
# Expected savings: 40-50%

Strategy 3: Tag-Based Resource Grouping

# Tag strategy example
Resource Type: EKS Node
Tags:
  - Environment: production
  - Criticality: high
  - DevOpsGuru: enabled

# DevOps Guru configuration
aws devops-guru update-resource-collection \
  --action ADD \
  --resource-collection '{
    "Tags": {
      "TagValues": [
        {
          "AppBoundaryKey": "Criticality",
          "TagValues": ["high", "critical"]
        }
      ]
    }
  }'

Strategy 4: Priority Setting by Resource Type

[High Priority - Must Monitor]
✓ EKS Cluster (Control Plane)
✓ RDS Instances (Database)
✓ DynamoDB Tables (NoSQL)
✓ ALB/NLB (Traffic Entry)
✓ Lambda (Serverless Functions)

[Medium Priority - Selective Monitoring]
△ EC2 Nodes (Managed by Karpenter)
△ ElastiCache (Cache Layer)
△ S3 Buckets (Storage)

[Low Priority - Can Exclude]
✗ Development environment resources
✗ Test Lambda functions
✗ Temporary EC2 instances

8.5.3 DevOps Guru vs CloudWatch Anomaly Detection Comparison

These two services are optimized for different use cases, and understanding the cost-feature tradeoffs is important.

Item	DevOps Guru	CloudWatch Anomaly Detection
Billing Model	Per resource-hour ($0.0028/resource-hour)	Per metric analysis count ($0.30/thousand metrics)
Analysis Scope	Complex resource correlation analysis	Single metric anomaly detection
Root Cause Analysis	AI-based automatic analysis	Not provided
Learning Period	1-2 weeks	2 weeks
Insight Quality	Very high (multi-layer analysis)	Medium (single metric)
Recommended Scenario	Complex system failure detection	Specific metric threshold detection

Cost Comparison Example:

[Scenario: 50 resources, average 10 metrics per resource]

DevOps Guru:
50 resources × 24 hours × 30 days × $0.0028 = $100.80/month
→ All 500 metrics analyzed, including correlations

CloudWatch Anomaly Detection:
500 metrics × 1,000 analyses/month × ($0.30/1,000) = $150/month
→ Single metrics only, no correlations

[Conclusion]
DevOps Guru offers better value for cost (when complex analysis is needed)
CloudWatch AD is suitable for single metric threshold monitoring

Feature/Cost Tradeoff Decision Matrix:

Complexity │ Recommended Solution
──────────┼─────────────────────────────────────
Very High  │ DevOps Guru (Full Stack Analysis)
   ↑       │
High       │ DevOps Guru (Core Resources Only)
   │       │
Medium     │ CloudWatch AD + Partial DevOps Guru
   │       │
Low        │ CloudWatch AD (Specific Metrics)
   │       │
Very Low   │ CloudWatch Alarms (Static Thresholds)
   ↓       │
──────────┴─────────────────────────────────────
       Low                              High
              Expected Monthly Cost →

8.5.4 Practical Cost Optimization Cases

Case 1: 80% Cost Reduction Through Phased Adoption

[Before]
- Entire AWS account enabled (500+ resources)
- Monthly cost: $1,008/month

[After - Step-by-Step Optimization]
Phase 1: Enable only production environment
  → Resources: 500 → 150
  → Monthly cost: $302.40/month (70% reduction)

Phase 2: Critical tag-based filtering
  → Resources: 150 → 80
  → Monthly cost: $161.28/month (84% reduction)

Phase 3: Mixed use with CloudWatch AD
  → DevOps Guru: 50 core resources
  → CloudWatch AD: 30 simple metrics
  → Total cost: $100.80 + $45 = $145.80/month (86% reduction)

Case 2: ROI-Based Justification

[DevOps Guru Cost]
$189.50/month (94 resources)

[Prevented Incident Cases (3 months)]
1. RDS connection pool saturation early detection
   → Prevented downtime: 2 hours
   → Prevented revenue loss: $50,000

2. Lambda cold start surge early warning
   → Prevented performance degradation: 4 hours
   → Prevented customer complaints: Immeasurable

3. DynamoDB read capacity exceeded prediction
   → Prevented service outage: 1 hour
   → Prevented revenue loss: $25,000

[ROI Calculation]
3-month cost: $189.50 × 3 = $568.50
Prevented losses: $75,000+
ROI: 13,092%

Cost Monitoring is Essential

DevOps Guru costs scale linearly with the number of resources. Check "DevOps Guru" service costs weekly in AWS Cost Explorer, and immediately apply tag filtering or stack-based scope adjustments if costs exceed expectations. Especially in environments where resources dynamically increase through Auto Scaling, you should estimate costs based on the maximum resource count.

Recommended Strategies by Scenario

DevOps Guru Usage Recommendations by Scenario:

When complex anomaly detection is needed → DevOps Guru (Full Stack)
- Example: Correlation analysis of "RDS connection count increase + Lambda timeout increase + API Gateway 5xx increase"
Single metric threshold monitoring → CloudWatch Anomaly Detection
- Example: "CPU utilization is higher than usual" (unrelated to other metrics)
When budget constraints exist → DevOps Guru for core resources only + CloudWatch Alarms for the rest
- Example: DevOps Guru only for production RDS + EKS control plane
Initial adoption phase → Utilize the 1-month free trial, enable fully and evaluate insight quality
- After 1 month, measure value vs. cost and adjust scope

8.5.5 Cost Alert Configuration

# Set up DevOps Guru cost alerts with AWS Budgets
aws budgets create-budget \
  --account-id ACCOUNT_ID \
  --budget '{
    "BudgetName": "DevOpsGuru-Monthly-Budget",
    "BudgetLimit": {
      "Amount": "200",
      "Unit": "USD"
    },
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST",
    "CostFilters": {
      "Service": ["Amazon DevOps Guru"]
    }
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "ops-team@example.com"
        }
      ]
    }
  ]'

7.5 GuardDuty Extended Threat Detection — EKS Security Observability

Amazon GuardDuty Extended Threat Detection started with EKS support in June 2025, then expanded to EC2 and ECS in December, establishing a new standard for container security observability. AI/ML-based multi-stage attack detection goes beyond the limitations of traditional security monitoring.

7.5.1 Announcement History and Expansion

June 17, 2025 - EKS Support Announcement:

Correlates EKS audit logs, runtime behavior, malware execution, and AWS API activity
Integrates with EKS Runtime Monitoring for container-level threat detection

December 2, 2025 - EC2, ECS Expansion:

Extended Threat Detection expanded to EC2 instances and ECS tasks
Evolved into a unified threat detection platform

7.5.2 Core Features

AI/ML-Based Multi-Stage Attack Detection:

Attack Sequence Findings: Automatically identifies attack sequences spanning multiple resources and data sources
Correlation Analysis Engine: Unified analysis of EKS audit logs + runtime behavior + malware execution + API activity
Automatic Critical Severity Classification: Distinguishes real threats from false positives, highlighting only Critical threats
Dramatically Reduced Initial Analysis Time: 90%+ time savings compared to manual log analysis

EKS-Specific Detection Patterns:

[Detection Scenario 1: Cryptomining Attack]
→ Abnormal container image pull (external registry)
→ High CPU utilization Pod execution
→ Outbound connection to known mining pool
→ Abnormal authentication attempts against API server
→ GuardDuty connects these 4 stages to generate an Attack Sequence Finding

[Detection Scenario 2: Privilege Escalation]
→ Abnormal ServiceAccount token access
→ ClusterRole binding modification attempt
→ Mass Secrets query
→ New privileged Pod creation
→ Automatically classified as Critical severity, immediate alert

7.5.3 Real Case: November 2025 Cryptomining Campaign Detection

This is a real threat detection case documented in detail on the AWS Security Blog (November 2025):

Attack Scenario:

[Started 2025-11-02]
1. Attacker infiltrated EKS worker node through exposed Docker API
2. Deployed cryptomining workload with normal-looking container names
3. Ran without CPU resource limits, exhausting node resources
4. Maintained outbound connections to mining pools

[GuardDuty Extended Threat Detection Discovery]
→ Runtime Monitoring detected abnormal CPU patterns
→ Network analysis identified connections to known mining pools
→ Audit Log analysis confirmed unauthorized container creation
→ Attack Sequence Finding generated (Critical severity)
→ Less than 15 minutes from detection to alert

[Result]
→ Automatic isolation action (Lambda + EventBridge)
→ Immediate replacement of affected nodes (Karpenter)
→ Prevention of recurrence: Network Policy + PodSecurityPolicy hardening

Lessons from Real Threats

This cryptomining campaign targeted hundreds of AWS accounts. Without GuardDuty Extended Threat Detection, most organizations would not have been aware of the attack until receiving their end-of-month bill. Security observability is the first step in cost optimization.

7.5.4 Observability Stack Integration

GuardDuty Extended Threat Detection integrates seamlessly with the existing observability stack:

CloudWatch Integration Example:

# Query GuardDuty Findings in CloudWatch
aws cloudwatch get-metric-statistics \
  --namespace AWS/GuardDuty \
  --metric-name FindingCount \
  --dimensions Name=Severity,Value=CRITICAL \
  --start-time 2026-02-01T00:00:00Z \
  --end-time 2026-02-12T23:59:59Z \
  --period 3600 \
  --statistics Sum

# Automatic connection with CloudWatch Investigations
# GuardDuty Finding → Investigation auto-created → Root cause analysis

7.5.5 Activation Configuration

Step 1: Enable GuardDuty EKS Runtime Monitoring

# Enable EKS Protection in GuardDuty
aws guardduty update-detector \
  --detector-id <detector-id> \
  --features \
    Name=EKS_RUNTIME_MONITORING,Status=ENABLED \
    Name=EKS_ADDON_MANAGEMENT,Status=ENABLED

# Automatically deploy GuardDuty Agent to cluster
aws guardduty update-malware-scan-settings \
  --detector-id <detector-id> \
  --scan-resource-criteria \
    Include='{"MapEquals":[{"Key":"tag:eks-cluster","Value":"my-cluster"}]}'

Step 2: Enable Extended Threat Detection (Automatic)

# Extended Threat Detection is automatically enabled when EKS Runtime Monitoring is activated
# No additional cost, no API call required

# Verify
aws guardduty get-detector --detector-id <detector-id> \
  --query 'Features[?Name==`EKS_RUNTIME_MONITORING`].Status' \
  --output text

Step 3: Configure EventBridge Automated Response

# GuardDuty Finding → Automatic Isolation
apiVersion: events.amazonaws.com/v1
kind: Rule
metadata:
  name: guardduty-critical-finding
spec:
  eventPattern:
    source:
      - aws.guardduty
    detail-type:
      - GuardDuty Finding
    detail:
      severity:
        - 7
        - 8
        - 9  # HIGH, CRITICAL
      resource:
        resourceType:
          - EKSCluster
  targets:
    - arn: arn:aws:lambda:ap-northeast-2:ACCOUNT_ID:function:isolate-pod
    - arn: arn:aws:sns:ap-northeast-2:ACCOUNT_ID:security-alerts

GuardDuty Extended Threat Detection Prerequisites

Extended Threat Detection's full threat detection capabilities only work when EKS Runtime Monitoring is enabled. Without Runtime Monitoring, Attack Sequence Findings cannot be generated, and only simple API-based detection is available.

7.5.6 Cost Structure

GuardDuty EKS Runtime Monitoring:

Billed per vCPU-hour: $0.008/vCPU-hour
Estimated cost for 30 days, 100 vCPU cluster: ~$576/month

Extended Threat Detection:

No additional cost when Runtime Monitoring is enabled
Attack Sequence Finding generation automatically included

ROI Analysis:

[Cryptomining Attack Prevention Case]
GuardDuty cost: $576/month
Blocked mining cost: $15,000/month (100 vCPU × 24 hours × $0.096/vCPU-hr × 30 days × 50% utilization)
Net savings: $14,424/month
ROI: 2,504%

MCP Integration: Security Observability Automation

GuardDuty Findings can be queried directly from Kiro and Q Developer through the CloudWatch MCP server:

[Kiro + MCP Security Automation]
Kiro: "Are there any current Critical security threats?"
→ CloudWatch MCP: get_guardduty_findings(severity="CRITICAL")
→ Finding: "Unauthorized Pod creation from external IP"
→ Kiro: Automatically creates Network Policy + Pod isolation + incident report

This is the fully automated loop of Observe → Analyze → Respond.

9. CloudWatch AI Natural Language Query + Investigations

9.1 CloudWatch AI Natural Language Query

CloudWatch AI NL Query is a feature that allows you to analyze metrics and logs using natural language. You can ask questions in natural language without knowing PromQL or CloudWatch Logs Insights query syntax.

Actual Query Examples:

# Natural Language Query → Automatic Conversion

Question: "Which EKS nodes had CPU utilization exceeding 80% in the last hour?"
→ Automatically generates CloudWatch Metrics Insights query

Question: "What time period had the most 5xx errors in payment-service?"
→ Automatically generates CloudWatch Logs Insights query

Question: "Which services have slower API response times today compared to yesterday?"
→ Automatically generates comparison analysis query

9.2 CloudWatch Investigations

CloudWatch Investigations is an AI-based root cause analysis tool that automatically collects and analyzes related metrics, logs, and traces when an alert occurs.

Analysis Process:

Alert Trigger: CloudWatch Alarm or DevOps Guru insight occurs
Context Collection: Automatically collects related metrics, logs, traces, and configuration change history
AI Analysis: AI analyzes collected data to infer root causes
Timeline Generation: Organizes event occurrence order by time period
Recommended Actions: Presents specific resolution approaches

[CloudWatch Investigation Result]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Investigation Summary: payment-service latency increase

Timeline:
  14:23 - RDS connection pool utilization surged (70% → 95%)
  14:25 - payment-service P99 latency 500ms → 2.3s
  14:27 - Downstream order-service also started being affected
  14:30 - CloudWatch Alarm triggered

Root Cause:
  Connection count of RDS instance (db.r5.large) approached
  max_connections, delaying new connection creation

Recommended Actions:
  1. Upgrade RDS instance class or adjust max_connections
  2. Optimize connection pooling library (HikariCP/PgBouncer) settings
  3. Consider introducing RDS Proxy
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Investigation + Hosted MCP

CloudWatch Investigations results can be queried directly from Kiro through the Hosted MCP server. "Are there any ongoing Investigations?" → MCP returns Investigation status → Kiro automatically generates response code. This is the complete loop of AI Analysis → Automated Response.

9.1.3 Regional Availability and Cross-Region Considerations

CloudWatch AI Natural Language Query is available in 10 regions since GA in August 2025, and understanding regional constraints is important.

Supported Regions (as of August 2025):

Region Code	Region Name	Query Processing Location
`us-east-1`	US East (N. Virginia)	Local
`us-east-2`	US East (Ohio)	Local
`us-west-2`	US West (Oregon)	Local
`ap-southeast-1`	Asia Pacific (Singapore)	Local
`ap-southeast-2`	Asia Pacific (Sydney)	Local
`ap-northeast-1`	Asia Pacific (Tokyo)	Local
`ap-east-1`	Asia Pacific (Hong Kong)	Cross-Region (US)
`eu-central-1`	Europe (Frankfurt)	Local
`eu-west-1`	Europe (Ireland)	Local
`eu-north-1`	Europe (Stockholm)	Local

Cross-Region Prompt Processing

When using natural language queries in the Hong Kong (ap-east-1) region, Cross-Region calls to the US region occur for prompt processing. This means:

Increased query response time (network latency)
Prompt text is transmitted across region boundaries (data residency considerations needed)
Possible Cross-Region data transfer costs

If you have data residency requirements: Use direct CloudWatch Logs Insights query syntax instead of natural language queries in the Hong Kong region.

Alternative Approaches for Unsupported Regions:

# When querying from an unsupported region (e.g., ap-northeast-2, Seoul)

# Natural language query not available
# "Generate query" button does not appear in CloudWatch console

# Alternative 1: Generate query in a supported region's console and copy
# 1. Generate query with natural language in the us-west-2 console
# 2. Copy the generated Logs Insights query
# 3. Run the query directly in the ap-northeast-2 console

# Alternative 2: Cross-Region query via AWS CLI (metrics only)
aws cloudwatch get-metric-statistics \
  --region ap-northeast-2 \
  --namespace AWS/EKS \
  --metric-name cluster_failed_node_count \
  --start-time 2026-02-01T00:00:00Z \
  --end-time 2026-02-12T23:59:59Z \
  --period 300 \
  --statistics Average

# Alternative 3: Direct CloudWatch Metrics Insights query (local execution)
SELECT AVG(cluster_failed_node_count)
FROM SCHEMA("AWS/EKS", ClusterName)
WHERE ClusterName = 'my-cluster'

Considerations for Cross-Region Metric Analysis:

# Scenario: Multi-region EKS cluster unified monitoring

# Incorrect approach (inefficient)
# Accessing each region's console individually to run natural language queries
# → Time-consuming, no unified view

# Correct approach
# 1. Select a central region (e.g., us-west-2)
# 2. Enable CloudWatch Cross-Region Observability
aws cloudwatch put-sink \
  --name central-monitoring-sink \
  --region us-west-2

# 3. Configure metric forwarding from each region to the central region
aws cloudwatch put-sink-policy \
  --sink-identifier arn:aws:cloudwatch:us-west-2:ACCOUNT_ID:sink/central-monitoring-sink \
  --policy '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"AWS": "ACCOUNT_ID"},
      "Action": ["oam:CreateLink","oam:UpdateLink"],
      "Resource": "*"
    }]
  }'

# 4. Connect source regions
for region in ap-northeast-2 eu-central-1 us-east-1; do
  aws cloudwatch put-link \
    --region $region \
    --label-template '$AccountName-$Region' \
    --resource-types AWS::CloudWatch::Metric AWS::Logs::LogGroup \
    --sink-identifier arn:aws:cloudwatch:us-west-2:ACCOUNT_ID:sink/central-monitoring-sink
done

# 5. Run unified natural language queries from the us-west-2 console
# "Show me all EKS clusters with high CPU across all regions"

Cost Structure:

Item	Billing Method	Estimated Cost
Natural language query generation	Per query	$0.01/query (first 1,000 free)
Logs Insights execution	Based on data scanned	$0.005/GB scanned
Cross-Region data transfer	Per GB	$0.02/GB (inter-region)
Cross-Region Observability	No additional cost	-

Actual Cost Example:

[Monthly Usage Pattern]
- Natural language queries: 500 (within first 1,000 free)
- Logs Insights scans: 100GB
- Cross-Region transfer: 10GB (unified monitoring)

[Monthly Cost]
Natural language queries: $0
Logs Insights: 100GB × $0.005 = $0.50
Cross-Region transfer: 10GB × $0.02 = $0.20
Total: $0.70/month

Region Selection Strategy

Production Environment Recommendations:

If the primary region is a supported region: Use natural language queries locally
If the primary region is an unsupported region:
- Development/Test: Generate queries in a supported region's console and copy
- Production: Centralize with CloudWatch Cross-Region Observability
If data residency requirements exist: Do not use natural language queries, use direct query syntax

Future Outlook:

AWS is continuing to expand regional availability for CloudWatch AI Natural Language Query. Local support is expected in Seoul (ap-northeast-2), additional Singapore AZs, and others during 2026. For the latest regional availability, refer to the AWS official documentation.

10. MCP Server-Based Unified Analysis

10.1 Changes MCP Brings to Observability

Previously, you had to open the CloudWatch console, Grafana dashboards, and X-Ray console separately to diagnose issues. With AWS MCP servers (50+ individual local GA or Fully Managed Preview), you can query all observability data from Kiro or Q Developer in a unified manner.

10.2 EKS MCP Server Tools

Key tools provided by the EKS MCP server:

EKS MCP Server Tools

EKS integration tools available in Kiro/Q Developer

Tool

Function

Use Case Scenario

get_cluster_status

Query overall cluster status

Regular health checks

list_pods

Pod list and status

Identify failing Pods

get_pod_logs

Query Pod logs

Error log analysis

describe_node

Node detailed information

Diagnose node resource issues

get_events

Query K8s events

Recent event analysis

list_deployments

Deployment status

Check deployment status

💡 Unified Analysis: Query CloudWatch, X-Ray, and EKS API through a single MCP interface. AI agents automatically analyze root causes without switching between multiple consoles.

10.3 Unified Analysis Scenario

Scenario: Report that "payment-service is slow"

Process of unified analysis through MCP in Kiro:

[Kiro + MCP Unified Analysis]

1. EKS MCP: list_pods(namespace="payment") → 3/3 Running, 0 Restarts ✓
2. EKS MCP: get_pod_logs(pod="payment-xxx", tail=100) → Multiple DB timeout errors
3. CloudWatch MCP: query_metrics("RDSConnections") → Connection count at 98%
4. CloudWatch MCP: get_insights(service="payment") → DevOps Guru insight exists
5. CloudWatch MCP: get_investigation("INV-xxxx") → RDS connection pool saturation confirmed

→ Kiro automatically:
   - Generates RDS Proxy adoption IaC code
   - Creates HikariCP connection pool optimization PR
   - Adjusts Karpenter NodePool (memory-based scaling)

Operational Insights Based on Diverse Data Sources

The core value of MCP is unifying multiple data sources into a single interface. AI agents can access CloudWatch metrics, X-Ray traces, EKS API, and DevOps Guru insights all at once, enabling faster and more accurate diagnostics than manually navigating between multiple consoles.

10.4 Programmatic Observability Automation

Observability through MCP enables programmatic automation:

[Directing Approach] - Manual, repetitive
  "Open CloudWatch console and check payment-service metrics"
  → "Find traces for that time period in X-Ray"
  → "Also check RDS metrics"
  → "So what's the cause?"

[Programmatic Approach] - Automated, systematic
  Kiro Spec: "Automatically diagnose when payment-service latency is abnormal"
  → MCP queries CloudWatch + X-Ray + EKS API in unified manner
  → AI analyzes root cause
  → Automatically generates fix code + PR

11. Alert Optimization and SLO/SLI

11.1 Alert Fatigue Problem

Alert fatigue is a serious operational issue in EKS environments:

Average EKS cluster: 50-200 alerts per day
Alerts actually requiring action: 10-15% of total
Alert Fatigue result: Important alerts ignored, delayed incident response

11.2 SLO-Based Alert Strategy

Configuring alerts based on SLO (Service Level Objectives) can significantly reduce Alert Fatigue.

# SLO-based alert example - Based on Error Budget burn rate
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payment-service-slo
spec:
  groups:
    - name: slo.payment-service
      rules:
        # SLI: Error rate
        - record: sli:payment_error_rate:5m
          expr: |
            sum(rate(http_requests_total{service="payment",status=~"5.."}[5m]))
            / sum(rate(http_requests_total{service="payment"}[5m]))

        # Error Budget burn rate (1 hour)
        - alert: PaymentErrorBudgetBurn
          expr: |
            sli:payment_error_rate:5m > (1 - 0.999) * 14.4
          for: 5m
          labels:
            severity: critical
            service: payment
          annotations:
            summary: "Payment service Error Budget burning rapidly"
            description: "Current error rate is burning Error Budget at 14.4x speed (1-hour window)"

11.3 Error Budget Concept

Error Budget Concept

SLO-based acceptable error rate and downtime

SLO

Monthly Error Budget

Allowed Downtime

99.9%

0.1%

43.2 min

99.95%

0.05%

21.6 min

99.99%

0.01%

4.32 min

💡 Error Budget-based Alerts: Alerting based on Error Budget burn rate instead of simple thresholds can reduce Alert Fatigue by 70%.

11.4 CloudWatch Composite Alarms

Logically combine multiple alarms to reduce noise.

# Composite Alarm: Alert only when both CPU AND Memory are simultaneously high
aws cloudwatch put-composite-alarm \
  --alarm-name "EKS-Node-Resource-Pressure" \
  --alarm-rule 'ALARM("EKS-Node-HighCPU") AND ALARM("EKS-Node-HighMemory")' \
  --alarm-actions "arn:aws:sns:ap-northeast-2:ACCOUNT_ID:ops-team" \
  --alarm-description "Alert only when node CPU and memory are simultaneously high"

📊 Observability Services Comparison

AWS Native vs Managed OSS vs AI Services

Service	Type	Cost Model	Best For
AMP	Managed OSS	Based on ingested metrics	Long-term storage of Prometheus-compatible metrics
AMG	Managed OSS	Based on users/workspaces	Unified dashboards + alerts
CloudWatch	AWS Native	Based on logs/metrics/requests	Integrated AWS service monitoring
X-Ray	AWS Native	Based on trace sampling	Distributed tracing
DevOps Guru	AWS AI	Based on analyzed resources	ML anomaly detection
Application Signals	AWS Native	Included in CloudWatch pricing	zero-code APM

11.5 Alert Optimization Checklist

Alert Optimization Checklist

Strategies and Effects for Solving Alert Fatigue

Item

Strategy

Expected Effect

SLO-based Alerts

Alert based on Error Budget burn rate

70% reduction in alert volume

Composite Alarms

Filter noise with composite conditions

50% reduction in false positives

DevOps Guru

ML auto-detects normal/anomalous patterns

80% reduction in false positives after learning

Alert Routing

Separate channels by severity (PagerDuty, Slack)

40% faster response time

Auto-Remediation

Alert → EventBridge → Lambda auto-response

60% reduction in manual intervention

💡 Alert Fatigue Problem: A typical EKS cluster generates 50-200 alerts per day, but only 10-15% require actual action. Combining SLO-based alerts with ML anomaly detection can significantly reduce noise.

11.6 Cost-Optimized Log Pipeline

EKS clusters generate tens to hundreds of GB of logs per day. CloudWatch Logs is convenient but costs can accumulate easily. This section covers strategies to optimize log costs while maintaining analysis capabilities.

11.6.1 CloudWatch Logs Cost Structure

Cost Item	Price (ap-northeast-2)	Example Cost (50-node cluster)
Ingestion	$0.50/GB	100GB/day → $1,500/month
Storage - Standard	$0.03/GB/month	30-day retention → $90/month
Storage - Infrequent Access	$0.01/GB/month	30-day retention → $30/month
Analysis (Insights queries)	$0.005/GB scanned	10 queries/day → $150/month

Problem:

CloudWatch Logs cost for production EKS cluster: $1,500-3,000/month
Most logs are never queried (over 90%)
S3 is 10x+ cheaper for long-term log retention

11.6.2 CloudWatch Logs Infrequent Access Class

In November 2023, AWS announced the Infrequent Access log class. It allows storing infrequently queried logs at a lower cost.

# Change log group to Infrequent Access
aws logs put-log-group-policy \
  --log-group-name /eks/my-cluster/application \
  --policy-name InfrequentAccessPolicy \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": "logs:CreateLogStream",
        "Resource": "*"
      }
    ]
  }'

# Change log class
aws logs put-retention-policy \
  --log-group-name /eks/my-cluster/application \
  --retention-in-days 30

aws logs put-log-group-policy \
  --log-group-name /eks/my-cluster/application \
  --log-group-class INFREQUENT_ACCESS

Infrequent Access Class Characteristics:

Characteristic	Standard	Infrequent Access
Ingestion Cost	$0.50/GB	$0.50/GB (same)
Storage Cost	$0.03/GB/month	$0.01/GB/month (67% reduction)
Query Cost	$0.005/GB scanned	$0.005/GB scanned (same)
Minimum Retention Period	None	None
Suitable Scenario	Real-time analysis	Audit, compliance

Infrequent Access Utilization Strategy

2-Tier Log Strategy:

Recent 7 days: Standard class (fast queries)
8-90 days: Infrequent Access class (affordable retention)

This approach reduces storage costs by approximately 50% while still allowing fast querying of recent logs.

11.6.3 S3 + Athena-Based Long-Term Log Analysis

For long-term retention beyond 90 days, configure a CloudWatch Logs → S3 → Athena pipeline.

# CloudWatch Logs Export to S3 (EventBridge-based automation)
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  LogExportBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: eks-logs-archive
      LifecycleConfiguration:
        Rules:
          - Id: TransitionToIA
            Status: Enabled
            Transitions:
              - TransitionInDays: 30
                StorageClass: STANDARD_IA
              - TransitionInDays: 90
                StorageClass: GLACIER_IR
      VersioningConfiguration:
        Status: Enabled

  LogExportRole:
    Type: AWS::IAM::Role
    Properties:
      AssumedBy:
        Service: logs.amazonaws.com
      Policies:
        - PolicyName: S3WriteAccess
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - s3:PutObject
                Resource: !Sub '${LogExportBucket.Arn}/*'

  DailyExportRule:
    Type: AWS::Events::Rule
    Properties:
      ScheduleExpression: 'cron(0 1 * * ? *)'  # Daily at 1:00 AM
      State: ENABLED
      Targets:
        - Arn: !GetAtt ExportLambda.Arn
          Id: TriggerExport

  ExportLambda:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.11
      Handler: index.handler
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        ZipFile: |
          import boto3
          import time
          from datetime import datetime, timedelta

          logs = boto3.client('logs')

          def handler(event, context):
              log_group_name = '/eks/my-cluster/application'
              destination_bucket = 'eks-logs-archive'

              # Yesterday's date range
              yesterday = datetime.now() - timedelta(days=1)
              start_time = int(yesterday.replace(hour=0, minute=0, second=0).timestamp() * 1000)
              end_time = int(yesterday.replace(hour=23, minute=59, second=59).timestamp() * 1000)

              # Start CloudWatch Logs Export
              response = logs.create_export_task(
                  logGroupName=log_group_name,
                  fromTime=start_time,
                  to=end_time,
                  destination=destination_bucket,
                  destinationPrefix=f'eks-logs/{yesterday.strftime("%Y/%m/%d")}/'
              )

              return {
                  'statusCode': 200,
                  'body': f'Export task created: {response["taskId"]}'
              }

Athena Query Table Creation:

-- Query logs stored in S3 with Athena
CREATE EXTERNAL TABLE eks_logs (
  timestamp BIGINT,
  message STRING,
  log_stream STRING,
  log_group STRING,
  kubernetes_pod_name STRING,
  kubernetes_namespace STRING,
  kubernetes_container_name STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://eks-logs-archive/eks-logs/'
TBLPROPERTIES ('has_encrypted_data'='false');

-- Add partitions (automate daily)
MSCK REPAIR TABLE eks_logs;

-- Query example: Analyze yesterday's error logs
SELECT
  kubernetes_namespace,
  kubernetes_pod_name,
  COUNT(*) as error_count
FROM eks_logs
WHERE year = '2026'
  AND month = '02'
  AND day = '12'
  AND message LIKE '%ERROR%'
GROUP BY kubernetes_namespace, kubernetes_pod_name
ORDER BY error_count DESC
LIMIT 10;

Cost Comparison (90-day retention):

Storage Method	Monthly Cost (100GB/day)	Notes
CloudWatch Standard	$270	Most expensive
CloudWatch IA	$90	67% reduction
S3 Standard	$23	91% reduction vs CloudWatch
S3 Standard-IA	$12.50	95% reduction vs CloudWatch
S3 Glacier IR	$4	98% reduction vs CloudWatch

11.6.4 Log Filtering Strategy: Cost Reduction by Dropping Unnecessary Logs

Not all logs are valuable. Filtering at the ingestion stage can significantly reduce costs.

Fluent Bit Filter Example (Built into ADOT):

# Fluent Bit ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: observability
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Daemon        off
        Log_Level     info

    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        Tag               kube.*
        Refresh_Interval  5
        Mem_Buf_Limit     50MB

    [FILTER]
        Name    grep
        Match   kube.*
        # Exclude DEBUG logs
        Exclude log DEBUG

    [FILTER]
        Name    grep
        Match   kube.*
        # Exclude health check logs
        Exclude log /healthz

    [FILTER]
        Name    grep
        Match   kube.*
        # Exclude Kubernetes system logs (kube-system namespace)
        Exclude kubernetes_namespace_name kube-system

    [FILTER]
        Name    grep
        Match   kube.*
        # Exclude Istio proxy access logs (can be replaced with metrics)
        Exclude kubernetes_container_name istio-proxy

    [FILTER]
        Name    modify
        Match   kube.*
        # Mask sensitive information
        Remove  password
        Remove  token
        Remove  api_key

    [OUTPUT]
        Name                cloudwatch_logs
        Match               kube.*
        region              ap-northeast-2
        log_group_name      /eks/my-cluster/application
        log_stream_prefix   ${HOSTNAME}-
        auto_create_group   true

Filtering Effect:

Filtering Item	Log Volume Reduction	Monthly Cost Savings (100GB/day basis)
Exclude DEBUG logs	30-40%	$450-600
Exclude health check logs	10-15%	$150-225
Exclude kube-system	5-10%	$75-150
Exclude Istio access logs	15-20%	$225-300
Total Savings	60-85%	$900-1,275

Filtering Caveats

Log filtering can sacrifice problem analysis capability. Follow these principles:

Production environment: Send only ERROR, WARN levels to CloudWatch
Development/Staging: Collect all logs (7-day retention)
Audit logs: Never filter (regulatory compliance)
Sampling: Apply 1/10 sampling for high-traffic services

11.6.5 Log Routing Optimization with Data Firehose

Amazon Data Firehose (formerly Kinesis Data Firehose) can route and transform logs to multiple destinations in real time.

# CloudWatch Logs → Firehose → S3/OpenSearch/Redshift
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  LogDeliveryStream:
    Type: AWS::KinesisFirehose::DeliveryStream
    Properties:
      DeliveryStreamName: eks-logs-delivery
      DeliveryStreamType: DirectPut
      ExtendedS3DestinationConfiguration:
        BucketARN: !GetAtt LogArchiveBucket.Arn
        RoleARN: !GetAtt FirehoseRole.Arn
        Prefix: 'logs/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/'
        ErrorOutputPrefix: 'errors/'
        BufferingHints:
          SizeInMBs: 128
          IntervalInSeconds: 300
        CompressionFormat: GZIP
        # Data transformation (JSON normalization via Lambda)
        ProcessingConfiguration:
          Enabled: true
          Processors:
            - Type: Lambda
              Parameters:
                - ParameterName: LambdaArn
                  ParameterValue: !GetAtt LogTransformLambda.Arn
        # Dynamic Partitioning (automatic classification by namespace)
        DynamicPartitioningConfiguration:
          Enabled: true
          RetryOptions:
            DurationInSeconds: 300
        # Simultaneous OpenSearch delivery
        ProcessingConfiguration:
          Enabled: true
          Processors:
            - Type: AppendDelimiterToRecord
              Parameters:
                - ParameterName: Delimiter
                  ParameterValue: '\\n'

  # CloudWatch Logs Subscription Filter
  LogSubscriptionFilter:
    Type: AWS::Logs::SubscriptionFilter
    Properties:
      LogGroupName: /eks/my-cluster/application
      FilterPattern: ''  # All logs
      DestinationArn: !GetAtt LogDeliveryStream.Arn
      RoleArn: !GetAtt CloudWatchLogsRole.Arn

  # Log transformation Lambda
  LogTransformLambda:
    Type: AWS::Lambda::Function
    Properties:
      Runtime: python3.11
      Handler: index.handler
      Code:
        ZipFile: |
          import json
          import base64
          import gzip

          def handler(event, context):
              output = []

              for record in event['records']:
                  # Decode CloudWatch Logs data
                  payload = base64.b64decode(record['data'])
                  decompressed = gzip.decompress(payload)
                  log_data = json.loads(decompressed)

                  for log_event in log_data['logEvents']:
                      # JSON parsing and normalization
                      try:
                          parsed = json.loads(log_event['message'])
                          transformed = {
                              'timestamp': log_event['timestamp'],
                              'level': parsed.get('level', 'INFO'),
                              'message': parsed.get('message', ''),
                              'namespace': log_data['logGroup'].split('/')[-2],
                              'pod': log_data['logStream']
                          }

                          output.append({
                              'recordId': record['recordId'],
                              'result': 'Ok',
                              'data': base64.b64encode(
                                  json.dumps(transformed).encode('utf-8')
                              ).decode('utf-8')
                          })
                      except:
                          # Keep original on parse failure
                          output.append({
                              'recordId': record['recordId'],
                              'result': 'Ok',
                              'data': record['data']
                          })

              return {'records': output}

Advantages of Firehose-Based Pipeline:

Multi-destination routing: Simultaneously deliver the same logs to S3 + OpenSearch + Redshift
Real-time transformation: JSON normalization and sensitive information masking via Lambda
Automatic compression: Store in GZIP, Snappy, Parquet formats (70% storage savings)
Dynamic Partitioning: Automatic classification by namespace, Pod, and date
Cost efficiency: 60-80% storage cost reduction compared to CloudWatch Logs

Cost Comparison (Including Firehose):

Item	CloudWatch Only	Firehose + S3	Savings
Ingestion	$1,500/month	$1,500/month	-
CloudWatch Storage (7 days)	$210/month	$7/month	97% reduction
Firehose Processing	-	$150/month	-
S3 Storage (90 days)	-	$23/month	-
Total Cost	$1,710/month	$1,680/month	2% reduction

The True Value of Firehose

Short-term cost savings are not significant, but in long-term retention scenarios (e.g., 1 year), it saves over 80% compared to CloudWatch. Additionally, logs stored in S3 can be utilized by various analysis tools such as Athena, Redshift Spectrum, and EMR, greatly enhancing analysis flexibility.

11.7 IaC MCP Server-Based Observability Stack Automated Deployment

The AWS Infrastructure as Code (IaC) MCP Server announced on November 28, 2025, fundamentally changes how observability stacks are deployed. With just a natural language request, it automatically generates CDK or CloudFormation templates, performs pre-deployment validation, and automatically applies best practices.

11.6.1 IaC MCP Server Overview

The AWS IaC MCP Server is a tool that implements the Model Context Protocol, enabling AI clients (Kiro, Amazon Q Developer) to understand and generate infrastructure code.

Core Features:

Feature	Description	Observability Stack Application
Documentation Search	Real-time CDK/CloudFormation official documentation lookup	Automatic search for AMP, AMG, ADOT Collector configuration examples
Template Generation	Natural language → IaC code automatic conversion	"Deploy EKS observability stack" → Full stack code generation
Syntax Validation	Pre-deployment IaC template validation	CloudFormation Linter, CDK synth automatic execution
Best Practice Application	Automatic insertion of AWS Well-Architected patterns	Tag strategy, IAM least privilege, encryption enabled by default
Troubleshooting	Deployment failure root cause analysis and resolution suggestions	"AMP workspace creation failed" → Automatic permission issue diagnosis

11.6.2 Kiro + IaC MCP Server Automated Deployment Workflow

11.6.3 Practical Usage Examples

Scenario 1: Fully Automated Observability Stack Deployment

// Request to Kiro: "Deploy an observability stack for EKS cluster my-cluster"
// → IaC MCP Server automatically generates the following CDK code:

import * as cdk from 'aws-cdk-lib';
import * as aps from 'aws-cdk-lib/aws-aps';
import * as grafana from 'aws-cdk-lib/aws-grafana';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as eks from 'aws-cdk-lib/aws-eks';

export class EksObservabilityStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // 1. Create AMP Workspace
    const ampWorkspace = new aps.CfnWorkspace(this, 'ObservabilityWorkspace', {
      alias: 'my-cluster-observability',
      tags: [
        { key: 'Environment', value: 'production' },
        { key: 'ManagedBy', value: 'Kiro-IaC-MCP' }
      ]
    });

    // 2. Create AMG Workspace
    const amgWorkspace = new grafana.CfnWorkspace(this, 'GrafanaWorkspace', {
      accountAccessType: 'CURRENT_ACCOUNT',
      authenticationProviders: ['AWS_SSO'],
      permissionType: 'SERVICE_MANAGED',
      dataSources: ['PROMETHEUS', 'CLOUDWATCH', 'XRAY'],
      name: 'my-cluster-grafana',
      roleArn: this.createGrafanaRole().roleArn
    });

    // 3. ADOT Collector IAM Role
    const adotRole = new iam.Role(this, 'AdotCollectorRole', {
      assumedBy: new iam.ServicePrincipal('eks.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('CloudWatchAgentServerPolicy'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('AWSXRayDaemonWriteAccess')
      ],
      inlinePolicies: {
        'AMPRemoteWrite': new iam.PolicyDocument({
          statements: [
            new iam.PolicyStatement({
              effect: iam.Effect.ALLOW,
              actions: ['aps:RemoteWrite', 'aps:GetSeries', 'aps:GetLabels', 'aps:GetMetricMetadata'],
              resources: [ampWorkspace.attrArn]
            })
          ]
        })
      }
    });

    // 4. EKS Add-on Automatic Installation (Helm Integration)
    new cdk.CfnOutput(this, 'AdotInstallCommand', {
      value: `
        # ADOT Collector Installation
        kubectl apply -f - <<EOF
        apiVersion: opentelemetry.io/v1beta1
        kind: OpenTelemetryCollector
        metadata:
          name: adot-collector
          namespace: observability
        spec:
          mode: deployment
          serviceAccount: adot-collector
          config:
            receivers:
              prometheus:
                config:
                  scrape_configs:
                    - job_name: 'kubernetes-pods'
                      kubernetes_sd_configs:
                        - role: pod
            exporters:
              prometheusremotewrite:
                endpoint: ${ampWorkspace.attrWorkspaceId}/api/v1/remote_write
                auth:
                  authenticator: sigv4auth
            extensions:
              sigv4auth:
                region: ${this.region}
                service: aps
            service:
              extensions: [sigv4auth]
              pipelines:
                metrics:
                  receivers: [prometheus]
                  exporters: [prometheusremotewrite]
        EOF
      `
    });

    // 5. Enable Container Insights
    new cdk.CfnOutput(this, 'ContainerInsightsCommand', {
      value: `
        aws eks create-addon \\
          --cluster-name my-cluster \\
          --addon-name amazon-cloudwatch-observability \\
          --service-account-role-arn ${adotRole.roleArn}
      `
    });
  }

  private createGrafanaRole(): iam.Role {
    return new iam.Role(this, 'GrafanaRole', {
      assumedBy: new iam.ServicePrincipal('grafana.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonPrometheusQueryAccess'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('CloudWatchReadOnlyAccess'),
        iam.ManagedPolicy.fromAwsManagedPolicyName('AWSXRayReadOnlyAccess')
      ]
    });
  }
}

Scenario 2: Step-by-Step Deployment and Verification

[Kiro + IaC MCP Server Automated Workflow]

Step 1: Natural Language Request
User: "Deploy an EKS observability stack"

Step 2: IaC MCP Server Documentation Search
→ resolve_cfn_resource("AWS::APS::Workspace")
→ resolve_cfn_resource("AWS::Grafana::Workspace")
→ search_cdk_docs("ADOT Collector CDK")

Step 3: CDK Template Generation
→ Automatically generates the TypeScript code above
→ IAM least privilege principle automatically applied
→ Tag strategy automatically inserted (Environment, ManagedBy, CostCenter)

Step 4: Pre-Deployment Validation (IaC MCP Server Built-in)
→ cdk synth (syntax validation)
→ cfn-lint (CloudFormation best practice check)
→ IAM Policy Simulator (permission verification)
→ Result: All checks passed

Step 5: GitOps Deployment via Managed Argo CD
→ Commit code to Git repository
→ Argo CD automatically syncs
→ Changes are trackable

Step 6: Post-Deployment Automatic Verification
→ AMP workspace status check (ACTIVE)
→ AMG datasource connection test (SUCCESS)
→ ADOT Collector Pod status (Running 2/2)
→ First metric collection confirmed (within 30 seconds)

Complete: "Observability stack has been successfully deployed."

11.6.4 Key Advantages of IaC MCP Server

1. Reduced Manual YAML Writing Time

[Before - Manual Writing]
- AMP workspace creation: 15 min (documentation reference + YAML writing)
- IAM role setup: 30 min (policy document writing + permission testing)
- ADOT Collector configuration: 45 min (Helm values writing + debugging)
- AMG connection: 20 min (datasource setup)
Total work time: 110 min

[After - IaC MCP Server]
- Natural language request: 1 min
- Code generation and validation: 2 min
- Deployment: 5 min
Total work time: 8 min

→ 93% time reduction

2. Automatic Best Practice Application

The IaC MCP Server automatically applies observability best practices from the AWS Well-Architected Framework:

Best Practice	Automatically Applied Content
Security	IAM least privilege principle, SigV4 authentication automatic setup
Reliability	AMP/AMG high availability configuration enabled by default
Performance	ADOT Collector resource limits automatically set
Cost Optimization	Metric filtering (unnecessary go_, process_ removed)
Operational Excellence	Tag strategy automatically applied, CloudWatch alerts configured by default

3. Configuration Error Prevention

# Common manual configuration error examples

# Incorrect configuration (common mistake in manual writing)
exporters:
  prometheusremotewrite:
    endpoint: "https://aps-workspaces.ap-northeast-2.amazonaws.com/workspaces/ws-xxxxx/api/v1/remote_write"
    # Problem: SigV4 authentication missing → 403 Forbidden

# IaC MCP Server auto-generated (correct configuration)
exporters:
  prometheusremotewrite:
    endpoint: "https://aps-workspaces.ap-northeast-2.amazonaws.com/workspaces/ws-xxxxx/api/v1/remote_write"
    auth:
      authenticator: sigv4auth
    resource_to_telemetry_conversion:
      enabled: true
extensions:
  sigv4auth:
    region: ap-northeast-2
    service: aps

11.6.5 GitOps Integration with Managed Argo CD

Code generated by the IaC MCP Server is deployed via GitOps through the EKS Capability Managed Argo CD.

# ArgoCD Application auto-generation example
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: eks-observability-stack
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/your-org/eks-infra
    targetRevision: HEAD
    path: observability-stack
    helm:
      valueFiles:
        - values-production.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: observability
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas  # When managed by Karpenter

Advantages of GitOps Deployment:

Change History Tracking: Track all infrastructure changes through Git commit history
Easy Rollback: Restore to previous state with a single git revert
PR-Based Review: Apply code review process to infrastructure changes as well
Multi-Cluster Deployment: Consistently deploy the same observability stack across multiple clusters

11.6.6 Practical Usage Tips

Tip 1: Minimize Risk with Step-by-Step Deployment

[Recommended Deployment Order]
Phase 1: Test IaC MCP Server generated code on development cluster
Phase 2: Commit generated code to Git, create PR
Phase 3: Deploy to staging cluster after team review
Phase 4: Validate on staging for 1 week
Phase 5: Deploy to production cluster

Tip 2: When Customization is Needed

User: "Deploy an EKS observability stack, but set AMP metric retention period to 90 days"

→ IaC MCP Server automatically adds retention configuration:

const ampWorkspace = new aps.CfnWorkspace(this, 'ObservabilityWorkspace', {
  alias: 'my-cluster-observability',
  loggingConfiguration: {
    logGroupArn: logGroup.logGroupArn
  },
  // Customized retention period
  tags: [
    { key: 'RetentionDays', value: '90' }
  ]
});

Tip 3: Automatic Cost Optimization Settings

User: "Deploy an EKS observability stack, but include cost optimization settings"

→ IaC MCP Server automatically:
  - Filters unnecessary metrics (go_*, process_*)
  - Changes scrape interval from 15s → 30s
  - Reduces network requests with batch processor
  - Excludes DevOps Guru for development/staging environments

Core Value of IaC MCP Server

The AWS IaC MCP Server is not just a code generator. It references AWS official documentation in real time, automatically applies best practices, and performs pre-deployment validation -- it is an intelligent infrastructure code assistant. Even complex configurations connecting multiple services (AMP, AMG, ADOT, Container Insights, DevOps Guru) like an observability stack can be resolved with a single line of natural language.

Synergy of Kiro + IaC MCP Server Combination

Kiro leverages the IaC MCP Server to automate not only infrastructure deployment but also continuous improvement:

Observability Data Analysis: Query metrics via CloudWatch MCP Server
Problem Detection: Detect "ADOT Collector CPU utilization is high"
Solution Derivation: Generate resource limit adjustment code via IaC MCP Server
PR Creation: Automatically submit changes as a Git PR
Deployment: Managed Argo CD automatically deploys after approval

This is the fully automated loop of Observe → Analyze → Improve.

12. Conclusion

12.1 Build Order Summary

The following order is recommended for building an intelligent observability stack:

Phase 1: Deploy Managed Add-ons
  └── ADOT + CloudWatch Observability + Node Monitoring + Flow Monitor

Phase 2: Connect AMP + AMG
  └── Remote Write configuration + Grafana dashboard setup

Phase 3: Enable Application Signals
  └── Zero-code instrumentation + Automatic SLI/SLO configuration

Phase 4: Enable DevOps Guru
  └── ML anomaly detection + Root cause analysis

Phase 5: CloudWatch AI + MCP Integration
  └── Natural language queries + Kiro/Q Developer integration

Phase 6: Alert Optimization
  └── SLO-based alerts + Composite Alarms + Automated recovery

12.2 Next Steps

Based on this observability stack, study the following topics:

3. AIDLC Framework: AI-driven development lifecycle and the development feedback loop with observability data
4. Predictive Scaling and Automated Recovery: ML prediction and automated recovery patterns based on observability data
1. AIOps Strategy Guide: Overall AIOps strategy and the role of observability

12.3 Learning Path

[Current Document] 2. Building an Intelligent Observability Stack
     ↓
[Next] 3. AIDLC Framework — AI Development Automation Using Observability Data
     ↓
[Advanced] 4. Predictive Scaling and Automated Recovery — Predictive Operations Based on Observability

1. Overview​

1.1 3-Pillar Observability + AI Analysis Layer​

1.3 Observability Stack Selection Patterns​

1.2 Why Observability Matters in EKS​

2. Managed Add-ons Based Observability Foundation​

2.1 ADOT (AWS Distro for OpenTelemetry) Add-on​

2.2 CloudWatch Observability Agent Add-on​

2.3 Node Monitoring Agent Add-on (2025)​

2.3.1 Integration of Node Readiness Controller with Observability​

2.4 Container Network Observability (2025.11)​

2.6 CloudWatch Generative AI Observability​

2.6.1 Core Features​

2.6.2 4-Pillar Observability Architecture​

2.6.3 Activation Method​

2.6.4 MCP Integration and Automation​

2.6.5 Real-World Use Cases​

3. Overall Architecture​

🏗️ Observability Architecture Layers

3.1 Data Flow Summary​

4. ADOT Collector Deployment​

4.1 OpenTelemetryCollector CRD​

4.2 DaemonSet Mode Deployment​

4.3 Pipeline Configuration Principles​

5. AMP + AMG Integration​

5.1 AMP (Amazon Managed Prometheus)​

5.2 Remote Write Configuration​

5.3 AMG (Amazon Managed Grafana) Data Source Connection​

5.4 Essential PromQL Queries​

5.5 Grafana Alloy: Next-Generation Collector Pattern​

5.5.1 Grafana Alloy vs ADOT Comparison​

5.5.2 Deploying Grafana Alloy on EKS​

5.5.3 AMP + Alloy Combination vs AMP + ADOT Combination​

5.5.4 Integration with Grafana Cloud​

6. CloudWatch Cross-Account Observability​

6.1 The Need for Multi-Account Observability​

6.2 Cross-Account Architecture​

6.3 OAM (Observability Access Manager) Configuration​

6.3.1 Create Sink in Monitoring Account​

6.3.2 Sink Policy Configuration (Access Authorization)​

6.3.3 Create Link in Source Accounts​

6.4 Unified Dashboard Configuration​

6.5 Cross-Account X-Ray Tracing​

6.6 Cost Considerations​

6.7 Production Operation Patterns​

7. CloudWatch Container Insights Enhanced​

6.1 Enhanced Container Insights Features​

6.2 Collected Metrics Scope​

6.3 EKS Control Plane Metrics​

6.4 Windows Workload Container Insights Support​

6.4.1 Mixed Cluster Observability Strategy​

6.4.2 Windows-Specific Metrics​

6.4.3 Mixed Cluster Dashboard Configuration​

7. CloudWatch Application Signals​

7.1 Supported Languages and Instrumentation Methods​

7.2 Activation Method​

7.3 Automatic Service Map Generation​

8. DevOps Guru EKS Integration​

8.1 Resource Group Configuration​

8.2 How ML Anomaly Detection Works​

8.3 Real Anomaly Detection Scenario​

8.4 Cost and Activation Tips​

8.5 DevOps Guru Cost Structure and Optimization​

8.5.1 Billing Model Details​

8.5.2 Cost Optimization Strategies​

8.5.3 DevOps Guru vs CloudWatch Anomaly Detection Comparison​

8.5.4 Practical Cost Optimization Cases​

8.5.5 Cost Alert Configuration​

7.5 GuardDuty Extended Threat Detection — EKS Security Observability​

7.5.1 Announcement History and Expansion​

7.5.2 Core Features​

7.5.3 Real Case: November 2025 Cryptomining Campaign Detection​

7.5.4 Observability Stack Integration​

7.5.5 Activation Configuration​

7.5.6 Cost Structure​

9. CloudWatch AI Natural Language Query + Investigations​

9.1 CloudWatch AI Natural Language Query​

9.2 CloudWatch Investigations​

9.1.3 Regional Availability and Cross-Region Considerations​

10. MCP Server-Based Unified Analysis​

10.1 Changes MCP Brings to Observability​

1. Overview

1.1 3-Pillar Observability + AI Analysis Layer

1.3 Observability Stack Selection Patterns

1.2 Why Observability Matters in EKS

2. Managed Add-ons Based Observability Foundation

2.1 ADOT (AWS Distro for OpenTelemetry) Add-on

2.2 CloudWatch Observability Agent Add-on

2.3 Node Monitoring Agent Add-on (2025)

2.3.1 Integration of Node Readiness Controller with Observability

2.4 Container Network Observability (2025.11)

2.6 CloudWatch Generative AI Observability

2.6.1 Core Features

2.6.2 4-Pillar Observability Architecture

2.6.3 Activation Method

2.6.4 MCP Integration and Automation

2.6.5 Real-World Use Cases

3. Overall Architecture

3.1 Data Flow Summary

4. ADOT Collector Deployment

4.1 OpenTelemetryCollector CRD

4.2 DaemonSet Mode Deployment

4.3 Pipeline Configuration Principles

5. AMP + AMG Integration

5.1 AMP (Amazon Managed Prometheus)

5.2 Remote Write Configuration

5.3 AMG (Amazon Managed Grafana) Data Source Connection

5.4 Essential PromQL Queries

5.5 Grafana Alloy: Next-Generation Collector Pattern

5.5.1 Grafana Alloy vs ADOT Comparison

5.5.2 Deploying Grafana Alloy on EKS

5.5.3 AMP + Alloy Combination vs AMP + ADOT Combination

5.5.4 Integration with Grafana Cloud

6. CloudWatch Cross-Account Observability

6.1 The Need for Multi-Account Observability

6.2 Cross-Account Architecture

6.3 OAM (Observability Access Manager) Configuration

6.3.1 Create Sink in Monitoring Account

6.3.2 Sink Policy Configuration (Access Authorization)

6.3.3 Create Link in Source Accounts

6.4 Unified Dashboard Configuration

6.5 Cross-Account X-Ray Tracing

6.6 Cost Considerations

6.7 Production Operation Patterns

7. CloudWatch Container Insights Enhanced

6.1 Enhanced Container Insights Features

6.2 Collected Metrics Scope

6.3 EKS Control Plane Metrics

6.4 Windows Workload Container Insights Support

6.4.1 Mixed Cluster Observability Strategy

6.4.2 Windows-Specific Metrics

6.4.3 Mixed Cluster Dashboard Configuration

7. CloudWatch Application Signals

7.1 Supported Languages and Instrumentation Methods

7.2 Activation Method

7.3 Automatic Service Map Generation

8. DevOps Guru EKS Integration

8.1 Resource Group Configuration

8.2 How ML Anomaly Detection Works

8.3 Real Anomaly Detection Scenario

8.4 Cost and Activation Tips

8.5 DevOps Guru Cost Structure and Optimization

8.5.1 Billing Model Details

8.5.2 Cost Optimization Strategies

8.5.3 DevOps Guru vs CloudWatch Anomaly Detection Comparison

8.5.4 Practical Cost Optimization Cases

8.5.5 Cost Alert Configuration

7.5 GuardDuty Extended Threat Detection — EKS Security Observability

7.5.1 Announcement History and Expansion

7.5.2 Core Features

7.5.3 Real Case: November 2025 Cryptomining Campaign Detection

7.5.4 Observability Stack Integration

7.5.5 Activation Configuration

7.5.6 Cost Structure

9. CloudWatch AI Natural Language Query + Investigations

9.1 CloudWatch AI Natural Language Query

9.2 CloudWatch Investigations

9.1.3 Regional Availability and Cross-Region Considerations

10. MCP Server-Based Unified Analysis

10.1 Changes MCP Brings to Observability

10.2 EKS MCP Server Tools