Skip to main content

Observability and Monitoring

Observability Stack Architecture

Container Insights Setup

# Install Container Insights Add-on
aws eks create-addon \
--cluster-name <cluster-name> \
--addon-name amazon-cloudwatch-observability

# Verify installation
kubectl get pods -n amazon-cloudwatch

Metric Debugging: PromQL Queries

CPU Throttling Detection

sum(rate(container_cpu_cfs_throttled_periods_total{namespace="production"}[5m]))
/ sum(rate(container_cpu_cfs_periods_total{namespace="production"}[5m])) > 0.25
CPU Throttling Threshold

Throttling above 25% causes performance degradation. Consider removing or increasing CPU limits. Many organizations adopt a strategy of setting only CPU requests without limits.

OOMKilled Detection

kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0

Pod Restart Rate

sum(rate(kube_pod_container_status_restarts_total[15m])) by (namespace, pod) > 0

Node CPU Utilization (Warn Above 80%)

100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80

Node Memory Utilization (Warn Above 85%)

(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85

Log Debugging: CloudWatch Logs Insights

Error Log Analysis

fields @timestamp, @message, kubernetes.container_name, kubernetes.pod_name
| filter @message like /ERROR|FATAL|Exception/
| sort @timestamp desc
| limit 50

Latency Analysis

fields @timestamp, @message
| filter @message like /latency|duration|elapsed/
| parse @message /latency[=:]\s*(?<latency_ms>\d+)/
| stats avg(latency_ms), max(latency_ms), p99(latency_ms) by bin(5m)

Error Pattern Analysis for a Specific Pod

fields @timestamp, @message
| filter kubernetes.pod_name like /api-server/
| filter @message like /error|Error|ERROR/
| stats count() by bin(1m)
| sort bin asc

OOMKilled Event Tracking

fields @timestamp, @message
| filter @message like /OOMKilled|oom-kill|Out of memory/
| sort @timestamp desc
| limit 20

Container Restart Events

fields @timestamp, @message, kubernetes.pod_name
| filter @message like /Back-off restarting failed container|CrashLoopBackOff/
| stats count() by kubernetes.pod_name
| sort count desc

Alert Rules: PrometheusRule Example

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
spec:
groups:
- name: kubernetes-pods
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 1h
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.pod }} has been restarting over the last 15 minutes."

- alert: PodOOMKilled
expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} OOMKilled"
description: "Pod {{ $labels.pod }} was terminated due to memory exhaustion. Adjust memory limits."

- name: kubernetes-nodes
rules:
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is NotReady"

- alert: NodeHighCPU
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} CPU usage above 80%"

- alert: NodeHighMemory
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} memory usage above 85%"

ADOT (AWS Distro for OpenTelemetry) Debugging

ADOT is the AWS-managed distribution of OpenTelemetry that collects traces, metrics, and logs and forwards them to various AWS services (X-Ray, CloudWatch, AMP, etc.).

# Check ADOT Add-on status
aws eks describe-addon --cluster-name $CLUSTER \
--addon-name adot --query 'addon.{status:status,version:addonVersion}'

# Check ADOT Collector Pods
kubectl get pods -n opentelemetry-operator-system
kubectl logs -n opentelemetry-operator-system -l app.kubernetes.io/name=opentelemetry-operator --tail=50

# Check OpenTelemetryCollector CR
kubectl get otelcol -A
kubectl describe otelcol -n $NAMESPACE $COLLECTOR_NAME

Common ADOT Issues

SymptomCauseResolution
Operator Pod CrashLoopBackOffCertManager not installedADOT operator requires CertManager for webhook certificates. kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
Collector fails to send to AMPInsufficient IAM permissionsAdd aps:RemoteWrite to IRSA/Pod Identity
X-Ray traces not receivedInsufficient IAM permissionsAdd xray:PutTraceSegments, xray:PutTelemetryRecords to IRSA/Pod Identity
CloudWatch metrics not receivedInsufficient IAM permissionsAdd cloudwatch:PutMetricData to IRSA/Pod Identity
Collector Pod OOMKilledInsufficient resourcesIncrease Collector resources.limits.memory when collecting high volumes of traces/metrics
ADOT Permission Separation

AMP remote write, X-Ray, and CloudWatch each require different IAM permissions. When the Collector sends data to multiple backends, ensure the IAM Role includes every required permission.


Incident Detection Mechanisms and Logging Architecture

Incident Detection Strategy Overview

To detect incidents quickly in EKS environments, a 4-layer pipeline of data source → collection → analysis & detection → alerting & response must be systematically configured. The layers must connect cohesively to minimize MTTD (Mean Time To Detect).

Four-Layer Architecture

LayerRoleCore Components
Data SourcesGenerate all observable signals in the clusterControl Plane Logs, Data Plane Logs, Metrics, Traces
Collection LayerStandardize data from various sources and ship centrallyFluent Bit, CloudWatch Agent, ADOT Collector
Analysis & DetectionAnalyze collected data and detect anomaliesCloudWatch Logs Insights, AMP, OpenSearch, Anomaly Detection
Alerting & ResponseNotify appropriate channels about detected incidents and run auto-remediationCloudWatch Alarms, Alertmanager, SNS → Lambda, PagerDuty/Slack

Option A: AWS-Native Stack (Small to Medium Clusters)

An architecture centered on AWS managed services to minimize operational overhead.

LayerComponentPurpose
CollectionFluent Bit (DaemonSet)Node/container log collection
TransportCloudWatch LogsCentral log store
AnalysisCloudWatch Logs InsightsQuery-based analysis
DetectionCloudWatch Anomaly DetectionML-based anomaly detection
AlertingCloudWatch Alarms → SNSThreshold/anomaly-based alerts

Fluent Bit DaemonSet deployment example:

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: amazon-cloudwatch
labels:
app.kubernetes.io/name: fluent-bit
spec:
selector:
matchLabels:
app.kubernetes.io/name: fluent-bit
template:
metadata:
labels:
app.kubernetes.io/name: fluent-bit
spec:
serviceAccountName: fluent-bit
containers:
- name: fluent-bit
image: public.ecr.aws/aws-observability/aws-for-fluent-bit:2.32.0
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlogpods
mountPath: /var/log/pods
readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlogpods
hostPath:
path: /var/log/pods
- name: fluent-bit-config
configMap:
name: fluent-bit-config
Fluent Bit vs Fluentd

Fluent Bit uses more than 10× less memory than Fluentd (~10MB vs ~100MB). Deploying Fluent Bit as a DaemonSet is the standard pattern in EKS. The amazon-cloudwatch-observability Add-on installs Fluent Bit automatically.

Option B: Open-Source-Based Stack (Large / Multi-Cluster)

An architecture combining open-source tools with AWS managed services for scalability and flexibility at scale.

LayerComponentPurpose
CollectionFluent Bit + ADOT CollectorUnified collection of logs/metrics/traces
MetricsAmazon Managed Prometheus (AMP)Time-series metrics store
LogsAmazon OpenSearch ServiceLarge-scale log analysis
TracesAWS X-Ray / JaegerDistributed tracing
VisualizationAmazon Managed GrafanaUnified dashboards
AlertingAlertmanager + PagerDuty/SlackAdvanced routing, grouping, silencing
Multi-Cluster Architecture

In multi-cluster environments, a hub-and-spoke pattern where each cluster's ADOT Collector sends metrics to a central AMP workspace is recommended. Grafana can then monitor all clusters from a single dashboard.

Incident Detection Patterns

Pattern 1: Threshold-Based Detection

The simplest detection method. Alerts fire when a predefined threshold is exceeded.

# PrometheusRule - threshold-based alert example
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: eks-threshold-alerts
namespace: monitoring
spec:
groups:
- name: eks-thresholds
rules:
- alert: HighPodRestartRate
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restart count increasing"
description: "{{ $value }} restarts occurred within the last hour"

- alert: NodeMemoryPressure
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.85
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.instance }} memory utilization above 85%"

- alert: PVCNearlyFull
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.9
for: 15m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} utilization above 90%"

Pattern 2: Anomaly Detection

ML-based approaches learn normal patterns and detect deviations. Useful when thresholds are hard to define in advance.

# Configure CloudWatch Anomaly Detection
aws cloudwatch put-anomaly-detector \
--single-metric-anomaly-detector '{
"Namespace": "ContainerInsights",
"MetricName": "pod_cpu_utilization",
"Dimensions": [
{"Name": "ClusterName", "Value": "'$CLUSTER'"},
{"Name": "Namespace", "Value": "production"}
],
"Stat": "Average"
}'

# Create alarm based on Anomaly Detection
aws cloudwatch put-metric-alarm \
--alarm-name "eks-cpu-anomaly" \
--alarm-description "Detect EKS CPU utilization anomalies" \
--evaluation-periods 3 \
--comparison-operator LessThanLowerOrGreaterThanUpperThreshold \
--threshold-metric-id ad1 \
--metrics '[
{
"Id": "m1",
"MetricStat": {
"Metric": {
"Namespace": "ContainerInsights",
"MetricName": "pod_cpu_utilization",
"Dimensions": [
{"Name": "ClusterName", "Value": "'$CLUSTER'"}
]
},
"Period": 300,
"Stat": "Average"
}
},
{
"Id": "ad1",
"Expression": "ANOMALY_DETECTION_BAND(m1, 2)"
}
]' \
--alarm-actions $SNS_TOPIC_ARN
Anomaly Detection Training Period

Anomaly Detection requires at least 2 weeks of training data. Use threshold-based alerts alongside it immediately after deploying new services.

Pattern 3: Composite Alarms

Combine multiple individual alarms logically to reduce noise and detect real incidents accurately.

# Combine individual alarms with AND/OR
aws cloudwatch put-composite-alarm \
--alarm-name "eks-service-degradation" \
--alarm-rule 'ALARM("high-error-rate") AND (ALARM("high-latency") OR ALARM("pod-restart-spike"))' \
--alarm-actions $SNS_TOPIC_ARN \
--alarm-description "Service degradation: error rate increase + latency increase or Pod restart spike"
Composite Alarm Tips

Individual alarms generate many false positives. Composite alarms that combine multiple signals can reliably detect real incidents. Example: "error rate increase AND latency increase" indicates a service outage; "error rate increase AND Pod restarts" indicates an application crash.

Pattern 4: Log-Based Metric Filters

Detect specific patterns in CloudWatch Logs, convert them to metrics, and set alerts.

# Convert OOMKilled events into metrics
aws logs put-metric-filter \
--log-group-name "/aws/eks/$CLUSTER/cluster" \
--filter-name "OOMKilledEvents" \
--filter-pattern '{ $.reason = "OOMKilled" || $.reason = "OOMKilling" }' \
--metric-transformations \
metricName=OOMKilledCount,metricNamespace=EKS/Custom,metricValue=1,defaultValue=0

# Detect 403 Forbidden events (security threat)
aws logs put-metric-filter \
--log-group-name "/aws/eks/$CLUSTER/cluster" \
--filter-name "UnauthorizedAccess" \
--filter-pattern '{ $.responseStatus.code = 403 }' \
--metric-transformations \
metricName=ForbiddenAccessCount,metricNamespace=EKS/Security,metricValue=1,defaultValue=0

Incident Detection Maturity Model

Classify organizational detection capability into four levels to diagnose the current level and define a roadmap to the next.

LevelStageDetection MethodToolsTarget MTTD
Level 1BasicManual monitoring + basic alarmsCloudWatch Alarms< 30 min
Level 2StandardThresholds + log metric filtersCloudWatch + Prometheus< 10 min
Level 3AdvancedAnomaly detection + Composite AlarmsAnomaly Detection + AMP< 5 min
Level 4AutomatedAuto-detection + auto-remediationLambda + EventBridge + FIS< 1 min
MTTD (Mean Time To Detect)

Average time from incident occurrence to detection. The goal is to continuously reduce MTTD as an organization matures from Level 1 to Level 4. Choose the appropriate level based on the organization's SLOs.

Auto-Remediation Pattern

A pattern where EventBridge and Lambda run automatic remediation actions when a specific incident is detected.

# EventBridge rule: trigger Lambda on Pod OOMKilled detection
aws events put-rule \
--name "eks-oom-auto-remediation" \
--event-pattern '{
"source": ["aws.cloudwatch"],
"detail-type": ["CloudWatch Alarm State Change"],
"detail": {
"alarmName": ["eks-oom-killed-alarm"],
"state": {"value": ["ALARM"]}
}
}'
Auto-Remediation Caution

Apply auto-remediation to production only after thorough testing. Incorrect auto-remediation logic can worsen an incident. Start by running in DRY_RUN mode to validate logic by receiving alerts only, then expand the automation scope gradually.

Set appropriate alert channels and response SLAs by severity to prevent alert fatigue and focus on important incidents.

SeverityAlert ChannelsResponse SLAExamples
P1 (Critical)PagerDuty + Phone CallWithin 15 minFull service outage, risk of data loss
P2 (High)Slack DM + PagerDutyWithin 30 minPartial outage, severe performance degradation
P3 (Medium)Slack channelWithin 4 hoursIncreased Pod restarts, resource utilization warnings
P4 (Low)Email / Jira ticketNext business dayDisk usage increase, certificate nearing expiration
Alert Fatigue

Too many alerts lead ops teams to ignore them (alert fatigue). Route P3/P4 alerts only to Slack channels and send only genuine incidents (P1/P2) to PagerDuty. Review alert rules periodically and remove false positives.