Skip to main content

Autonomous Response

📅 Date: 2026-04-07 | ⏱️ Reading time: About 12 minutes


1. Overview

Autonomous Response is an operational paradigm where AI Agents detect incidents, collect and analyze context, and autonomously execute recovery within predefined guardrails.

Three Stages of Autonomous Response

[Detection]
CloudWatch Alarm · DevOps Guru · K8s Event

[Decision]
Collect context with MCP → AI root cause analysis → Determine response

[Execution]
Automatic recovery within safe bounds OR escalation

Why Autonomous Response is Needed

  • MTTR Reduction: Manual response average 2 hours → AI autonomous response average 5 minutes
  • 24/7 Unmanned Operations: Significantly reduced burden of nighttime/weekend alerts
  • Consistency: Eliminate human judgment variance, standardized response
  • Learning Effect: Continuously improve accuracy by learning response patterns

2. Operations Automation Patterns: Human-Directed, Programmatically-Executed

The core of AIOps is a model where humans define Intent and guardrails, and the system executes programmatically.

2.1 Three Pattern Spectrum

Prompt-Driven (Interactive)

  • Humans direct each step in natural language
  • AI performs single tasks
  • Suitable for: Exploratory debugging, new types of failures
  • Limitation: Human-in-the-Loop, inefficient for repetitive scenarios

Spec-Driven (Codified)

  • Declaratively define operational scenarios as Specs
  • System executes programmatically
  • Suitable for: Repetitive deployments, standardized operational procedures
  • Core: Define Spec once → unlimited repeated execution

Agent-Driven (Autonomous)

  • AI Agent detects events → collects context → autonomous response
  • Human-on-the-Loop (humans set guardrails)
  • Suitable for: Automatic incident response, cost optimization, predictive scaling
  • Core: Second-level response, intelligent context-based judgment

2.2 Pattern Comparison: EKS Pod CrashLoopBackOff Response

Operation Pattern Comparison: EKS Cluster Issue Response
Prompt-Driven · Spec-Driven · Agent-Driven
항목
Prompt-Driven
Spec-Driven
Agent-Driven
Human Role
Direct each step (Human-in-the-Loop)
Define intent + Review results
Set guardrails + Handle exceptions (Human-on-the-Loop)
Response Start
Operator checks alert, then directs AI
Pre-defined pipeline trigger
Agent auto-starts after receiving alert
Data Collection
Request one by one via prompts
Auto-collect data defined in spec
Concurrent multi-source collection via MCP
Analysis
Operator reviews results, directs next step
Execute pre-defined validation logic
AI auto-analyzes to root cause
Recovery
AI executes after operator approval
Declarative rollback/change via GitOps
Autonomous recovery within guardrails
Learning
Relies on operator personal experience
Org knowledge via spec version history
Auto-learn from result feedback
Response Time
Minutes~Hours
Minutes
Seconds~Minutes
Representative Tools
Q Developer, ChatOps
Kiro + GitOps + Argo CD
Kagent, Strands SOPs
Real-world Combination: The three patterns are complementary. Explore new failures with Prompt-Driven, codify recurring patterns with Spec-Driven, and finally automate with Agent-Driven in a gradual maturity process.
Combining Patterns in Practice

The three patterns are not mutually exclusive but complementary. Explore and analyze new failure types with Prompt-Driven, codify repeatable patterns as Spec-Driven, and finally automate as Agent-Driven in a progressive maturity process.


3. AI Agent Incident Response

3.1 Limitations of Traditional Automation

EventBridge + Lambda-based automation is rule-based and has limitations:

[Traditional Method: Rule-Based Automation]
CloudWatch Alarm → EventBridge Rule → Lambda → Fixed Action

Problems:
✗ "Scale out if CPU > 80%" — cause could be memory leak
✗ "Alert if Pod restarts > 5" — response differs by cause
✗ Cannot handle complex failures
✗ Cannot adapt to new patterns

3.2 AI Agent-Based Autonomous Response

🚨 Incident Response Pattern Comparison

Traditional Response vs AI Agent Response

Traditional Response (Traditional)
1CloudWatch alarm triggered
2EventBridge rule matching
3Lambda function execution
4Static runbook execution (restart/scale)
5Manual escalation
Limitations:
Static rules, limited context, root cause unresolved
AI Agent Response (AI Agent)
1CloudWatch alerts + K8s events received
2Integrated metrics+logs+traces+events via MCP
3AI root cause analysis
4Context-based dynamic runbook generation
5Safe automated recovery execution
6Recovery verification + feedback learning
Advantages:
Multiple data sources, root cause resolution, self-learning

AI Agents respond autonomously with context-based judgment.

3.3 Kagent Automatic Incident Response

Kagent is a Kubernetes Native AI Agent that defines automatic responses through CRDs.

# Kagent: Automatic incident response agent
apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
name: incident-responder
namespace: kagent-system
spec:
description: "EKS incident automatic response agent"
modelConfig:
provider: bedrock
model: anthropic.claude-sonnet
region: ap-northeast-2
systemPrompt: |
You are an EKS incident response agent.

## Response Principles
1. Safety first: Escalate risky changes to humans
2. Root cause priority: Respond to causes, not symptoms
3. Minimal intervention: Perform only necessary minimum actions
4. Record all actions: Automatically report to Slack and JIRA

## Allowed Automatic Actions
- Pod restart (CrashLoopBackOff, 5+ times)
- HPA min/max adjustment (within ±50% of current value)
- Deployment rollback (to previous version)
- Node drain (MemoryPressure/DiskPressure)

## Escalation Targets
- Actions with potential data loss
- Impact to 50%+ of replicas
- StatefulSet related changes
- Network policy changes

tools:
- name: kubectl
type: kmcp
config:
allowedVerbs: ["get", "describe", "logs", "top", "rollout", "scale", "delete"]
deniedResources: ["secrets", "configmaps"]
- name: cloudwatch
type: kmcp
config:
actions: ["GetMetricData", "DescribeAlarms", "GetInsight"]
- name: slack
type: mcp
config:
webhook_url: "${SLACK_WEBHOOK}"
channel: "#incidents"

triggers:
- type: cloudwatch-alarm
filter:
severity: ["CRITICAL", "HIGH"]
- type: kubernetes-event
filter:
reason: ["CrashLoopBackOff", "OOMKilled", "FailedScheduling"]

3.4 Strands Agent SOP: Complex Failure Response

Strands is a Python-based OSS Agent framework that defines SOPs (Standard Operating Procedures) as code.

# Strands Agent: Automatic complex failure response
from strands import Agent
from strands.tools import eks_tool, cloudwatch_tool, slack_tool, jira_tool

incident_agent = Agent(
name="complex-incident-handler",
model="bedrock/anthropic.claude-sonnet",
tools=[eks_tool, cloudwatch_tool, slack_tool, jira_tool],
sop="""
## Complex Failure Response SOP

### Phase 1: Situation Assessment (within 30 seconds)
1. Query CloudWatch alarms and DevOps Guru insights
2. Check related service Pod status
3. Check node status and resource utilization
4. Check recent deployment history (changes within 10 minutes)

### Phase 2: Root Cause Analysis (within 2 minutes)
1. Extract error patterns from logs
2. Metric correlation analysis (CPU, Memory, Network, Disk)
3. Analyze temporal correlation with deployment changes
4. Check dependent service status

### Phase 3: Automatic Response
Automatic actions by cause:

**Deployment-Related Failures:**
- If deployment exists within last 10 minutes → automatic rollback
- Check status after rollback → complete if normalized

**Resource Shortage:**
- If CPU/Memory > 90% → adjust HPA or add Karpenter nodes
- If Disk > 85% → clean unnecessary logs/images

**Dependent Service Failures:**
- RDS connection failure → check connection pool settings, restart if needed
- SQS delays → check DLQ, scale out consumers

**Unknown Cause:**
- Escalate to human
- Share all collected data on Slack

### Phase 4: Post-Processing
1. Create incident timeline
2. Create JIRA incident ticket
3. Post report to Slack #incidents channel
4. Save as learning data (feedback loop)
"""
)
Core Value of AI Agents

Beyond EventBridge+Lambda, context-based autonomous response is possible. By unifying multiple data sources (CloudWatch, EKS API, X-Ray, deployment history) through MCP integration queries, even complex failures that cannot be handled by rules can be analyzed for root causes and appropriate actions automatically performed.

3.5 Amazon Q Developer Integration

Amazon Q Developer simplifies operations with a natural language interface:

[User Question]
"Find Pods experiencing OOM in this cluster"

[Amazon Q Developer Response]
OOM events found:
- payment-service-7d8f9c4b-xyz (namespace: payment)
└─ 3 recent OOMKilled (last hour)
└─ Memory limits: 512Mi, actual usage: 520Mi
└─ Recommendation: increase memory limits to 1Gi

Commands executed:
$ kubectl get events --all-namespaces --field-selector reason=OOMKilled
$ kubectl top pod -n payment payment-service-7d8f9c4b-xyz

Which action would you like?
1. Automatically adjust memory limits (apply VPA)
2. Start detailed memory profiling
3. Analyze all related logs

4. Tribal Knowledge Utilization

AI Agents learn the team's operational history (Tribal Knowledge) to automate recovery.

4.1 Learning Past Incidents

# Learning past incident response patterns
from strands import Agent

knowledge_base = {
"incident_patterns": [
{
"symptom": "payment-service 500 error surge",
"root_cause": "RDS connection pool exhaustion",
"solution": "Increase maxPoolSize or fix connection leak",
"frequency": 5,
"last_occurrence": "2026-03-15"
},
{
"symptom": "API Gateway 504 timeout",
"root_cause": "Lambda cold start + VPC ENI allocation delay",
"solution": "Enable Provisioned Concurrency",
"frequency": 3,
"last_occurrence": "2026-02-20"
}
]
}

# AI Agent references past patterns
tribal_agent = Agent(
name="tribal-knowledge-responder",
model="bedrock/anthropic.claude-sonnet",
tools=[eks_tool, knowledge_base_tool],
sop="""
## Tribal Knowledge-Based Response

1. Analyze current symptoms
2. Search for similar past patterns
3. Apply validated solutions first
4. If new pattern, explore then update Knowledge Base
"""
)

4.2 Automatic Knowledge Base Updates

# Automatic learning after incident response
apiVersion: batch/v1
kind: Job
metadata:
name: update-knowledge-base
spec:
template:
spec:
containers:
- name: learner
image: my-registry/incident-learner:latest
env:
- name: INCIDENT_ID
value: "INC-2026-04-07-001"
- name: KNOWLEDGE_BASE_S3
value: "s3://my-bucket/tribal-knowledge.json"

5. Kiro Programmatic Debugging

5.1 Directive vs Programmatic Response Comparison

[Directive-Based Response] — Manual, repetitive, high cost
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Operator: "payment-service 500 error occurring"
AI: "Which Pod is it occurring in?"
Operator: "payment-xxx Pod"
AI: "Show me the logs"
Operator: (execute kubectl logs and copy-paste)
AI: "Looks like DB connection error. Check RDS status"
...repeat...

Total time: 15-30 minutes, many manual tasks

[Programmatic Response] — Automatic, systematic, cost-efficient
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Alert: "payment-service 500 error occurring"

Kiro Spec:
1. Query Pod status with EKS MCP
2. Collect and analyze error logs
3. Check related AWS services (RDS, SQS) status
4. Diagnose root cause
5. Generate automatic fix code
6. Create PR and verify

Total time: 2-5 minutes, automated

5.2 Kiro + MCP Debugging Workflow

5.3 Specific Scenario: OOMKilled Automatic Response

[Kiro Programmatic Debugging: OOMKilled]

1. Detection: payment-service Pod OOMKilled event

2. Kiro Spec execution:
→ EKS MCP: get_events(namespace="payment", reason="OOMKilled")
→ EKS MCP: get_pod_logs(pod="payment-xxx", previous=true)
→ CloudWatch MCP: query_metrics("pod_memory_utilization", last="1h")

3. AI Analysis:
"Detected memory leak pattern where payment-service memory usage
increases by 256Mi every 2 hours after startup.
Confirmed Redis connections not properly closed in logs."

4. Automatic Fix:
- memory limits 256Mi → 512Mi (temporary measure)
- Generate Redis connection pool cleanup code patch
- Add memory profiling configuration

5. PR Creation:
Title: "fix: payment-service Redis connection leak"
- deployment.yaml: adjust memory limits
- redis_client.go: add defer conn.Close()
- monitoring: add memory usage dashboard
Core of Programmatic Debugging

Programmatically analyze and resolve issues through Kiro + EKS MCP. Compared to directive-style manual response, this enables cost-efficient and faster automation, and learned Specs can be reused when the same issue repeats.


6. Chaos Engineering + AI

6.1 AWS FIS EKS Action Types

AWS Fault Injection Service (FIS) provides EKS-specific fault injection actions:

💥 Chaos Engineering Experiment Results

AWS FIS-based Fault Injection and AI Learning

ExperimentInjected FaultSystem ReactionAI Learning
Pod Termination
Terminate 2/3 podsHPA recovery after 30s
"Pod termination → HPA response pattern"
Node Failure
Drain 1 nodeKarpenter replacement after 2 min
"Node failure → Karpenter response time"
Network Latency
Add 100ms latencyTimeout errors spike
"Network latency → timeout threshold"
CPU Stress
90% CPU loadThrottling occurs
"CPU stress → throttling pattern"
Memory Leak
Gradual memory increaseOOMKilled occurs
"Memory leak pattern → proactive detection rule"
Feedback Loop: As FIS injects faults and AI learns system response patterns, the AI Agent's automatic response capabilities continuously improve. The virtuous cycle of "fault injection → observation → learning → response improvement" is key to autonomous operations.

6.2 AI-Based Failure Pattern Learning

AI learns Chaos Engineering experiment results to improve response capabilities.

# AI learning data collection after FIS experiments
from strands import Agent

chaos_analyzer = Agent(
name="chaos-pattern-analyzer",
model="bedrock/anthropic.claude-sonnet",
sop="""
## Chaos Engineering Result Analysis

1. Collect FIS experiment results
- Injected failure type
- System reaction time
- Recovery time
- Impact scope

2. Pattern analysis
- Map failure propagation paths
- Identify vulnerable points
- Identify recovery bottlenecks

3. Update response rules
- Add learning content to existing SOP
- Create response rules for new patterns
- Adjust escalation thresholds

4. Generate report
- Experiment summary
- Discovered vulnerabilities
- Recommended improvements
"""
)

6.3 FIS Experiment Example: Pod Deletion with SLO Protection

{
"description": "EKS Pod fault injection with SLO protection",
"targets": {
"eks-payment-pods": {
"resourceType": "aws:eks:pod",
"selectionMode": "COUNT(2)",
"resourceTags": {
"app": "payment-service"
},
"parameters": {
"clusterIdentifier": "my-cluster",
"namespace": "payment"
}
}
},
"actions": {
"delete-pod-safely": {
"actionId": "aws:eks:pod-delete",
"parameters": {
"kubernetesServiceAccount": "fis-experiment-role",
"maxPodsToDelete": "2",
"podDeletionMode": "one-at-a-time"
},
"targets": {
"Pods": "eks-payment-pods"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:ap-northeast-2:ACCOUNT_ID:alarm:PaymentService-ErrorRate-SLO"
},
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:ap-northeast-2:ACCOUNT_ID:alarm:PaymentService-Latency-P99-SLO"
}
]
}

Safety Measures:

  • PodDisruptionBudget compliance: Guarantee minimum availability
  • stopConditions: Automatic stop on SLO violations
  • Progressive expansion: Gradual expansion from 1 → 10% → 25%
Chaos Engineering + AI Feedback Loop

When failures are injected with FIS and AI learns system reaction patterns, AI Agent's automatic response capabilities continuously improve. The feedback loop of "Failure injection → Observation → Learning → Response improvement" is the core of autonomous operations.


7. Feedback Loop — Operations to Ontology

7.1 Outer Loop: Operations → Ontology

Patterns learned during autonomous response are fed back to ontology for continuous improvement.

[Inner Loop: Real-time Incident Response]
Detection → Analysis → Recovery (seconds to minutes)

[Outer Loop: Ontology Feedback]
Operational patterns → Ontology update → Harness improvement (days to weeks)

Feedback Items:

ItemCollected DataOntology Reflection
Failure PatternsRoot cause, symptoms, recovery methodAdd new response rules
Recovery TimeMTTR, auto/manual response ratioAdjust automation priorities
Vulnerable PointsRepeatedly failing services/componentsArchitecture improvement recommendations
Response AccuracyAI judgment accuracy, escalation rateModel retraining, threshold adjustment

7.2 AgenticOps → AIDLC Cycle

Practical Example:

# Ontology feedback automation
apiVersion: batch/v1
kind: CronJob
metadata:
name: ontology-feedback
spec:
schedule: "0 2 * * 0" # Every Sunday 02:00
jobTemplate:
spec:
template:
spec:
containers:
- name: feedback-collector
image: my-registry/ontology-feedback:latest
env:
- name: INCIDENT_DB
value: "dynamodb://incidents-table"
- name: ONTOLOGY_REPO
value: "git://ontology-repo.git"
- name: FEEDBACK_THRESHOLD
value: "5" # Add to ontology after 5+ occurrences

Details: Ontology Engineering, Harness Engineering


8. AIOps Maturity Model

📊 AIOps Maturity Model
Evolution: Level 0 (Manual) → Level 4 (Autonomous)
Level 0
Manual
Manual monitoring, kubectl-based, reactive to failures
kubectlManual dashboardsManual alerts
Level 1
Reactive
Managed Add-ons + AMP/AMG, dashboard-based alerting
Managed Add-onsAMPAMGDashboard alerts
Level 2
Declarative
Managed Argo CD + ACK + KRO, GitOps declarative automation
Argo CDACKKROGitOps
Level 3
Predictive
CloudWatch AI + Q Developer, ML anomaly detection + predictive analytics
CloudWatch AIQ DeveloperML Anomaly DetectionPredictive Analytics
Level 4
Autonomous
Kiro + MCP + AI Agent expansion, autonomous operations
KiroMCPQ DeveloperStrandsKagent

Maturity Levels and Autonomous Response

LevelAutonomous Response LevelImplementation Method
Level 0Manual responseHumans directly execute kubectl
Level 1Alert-basedCloudWatch Alarm → Call human
Level 2Reactive automationEventBridge → Lambda → Fixed scripts
Level 3Predictive automationML prediction + preemptive actions
Level 4Autonomous operationsAI Agent context-based autonomous response
Gradual Adoption Recommended

Don't try to leap from Level 0 to Level 4 at once. Transitioning to the next level after accumulating sufficient operational experience and data at each level has higher success probability. Especially for Level 3 → Level 4 transition, validating AI autonomous recovery safety is key.


9. Conclusion

Core Summary

  1. Operations Automation Patterns: Progressive transition from Prompt-Driven → Spec-Driven → Agent-Driven
  2. AI Agent Frameworks: Kagent (K8s Native), Strands (Python OSS), Q Developer (AWS Managed)
  3. Tribal Knowledge: Response automation through past incident learning
  4. Kiro Programmatic Debugging: MCP-based automatic diagnosis, fix, PR generation
  5. Chaos Engineering + AI: FIS experiments → AI learning → Response capability improvement
  6. Outer Loop Feedback: Operational patterns → Ontology update → Harness improvement

Next Steps

StepActionReference Documentation
1Build observability stackObservability Stack
2Adopt ML predictive scalingPredictive Operations
3Deploy Kagent/Strands AgentThis document Section 3
4Chaos Engineering experimentsThis document Section 6
5Build ontology feedback loopOntology Engineering

References