Skip to main content

Kagent - Kubernetes AI Agent Management

This document covers how to efficiently deploy and manage AI agents in Kubernetes environments using Kagent. Kagent is an open-source tool that enables declarative management of AI agent lifecycles based on the Kubernetes Operator pattern.

Overview

Kagent is a reference architecture for managing AI agents in a Kubernetes-native manner. It allows declarative definition of agents, tools, and workflows through Custom Resource Definitions (CRDs), with an Operator that automatically deploys and manages them.

Kagent Project Status

Kagent is a reference architecture and design pattern for Kubernetes-based AI agent management. As an official open-source project is not yet publicly available, the examples in this document are based on conceptual implementation. For production environments, consider proven alternatives such as Bedrock AgentCore, KubeAI, and LangGraph Platform.

Alternative Solutions Comparison

🔍 Kagent Alternative Solutions Comparison
SolutionFeaturesSuitable Use Cases
Kagent (Reference)
AI agent-specific CRD, workflow orchestrationMulti-agent systems, complex workflows
KubeAI
Lightweight LLM serving, OpenAI-compatible APISimple model serving, rapid prototyping
Bedrock AgentCore
AWS managed Agent runtime, MCP/A2A native, auto-scalingAWS-native Agent deployment, managed infrastructure preferred
LangGraph Platform
Agent workflow framework, state management, LangSmith native integrationComplex multi-step agents, stateful workflows

Key Features

  • Declarative Agent Management: YAML-based agent definition and deployment
  • Tool Registry: Centralized management of tools used by agents via CRD
  • Auto-Scaling: Dynamic scaling through HPA/KEDA integration
  • Multi-Agent Orchestration: Agent collaboration for complex workflows
  • Observability Integration: Native integration with Langfuse/LangSmith and OpenTelemetry
Target Audience

This document is intended for Kubernetes administrators, platform engineers, and MLOps engineers. Understanding of basic Kubernetes concepts (Pod, Deployment, CRD) is required.

re:Invent 2025 Related Session

CNS421: Streamline Amazon EKS Operations with Agentic AI — A code talk session covering automated EKS cluster management using AI agents like Kagent, real-time issue diagnosis, and automatic recovery methods.

Key Content:

  • Model Context Protocol (MCP): Standard protocol for AI agents to integrate with AWS services
  • Automated Incident Response: Automatic diagnosis and recovery of Pod failures, resource shortages, network issues
  • AWS Service Integration: Native integration with CloudWatch, Systems Manager, EKS API
  • Live Demo: Real-time cluster problem-solving demonstration

Watch Session Video

Kagent Architecture

Kagent follows the Kubernetes Operator pattern and consists of Controller, CRD, and Webhook.

Component Description

ComponentRoleDescription
Kagent ControllerReconciliation loopDetect CRD changes and reconcile resources to desired state
Admission WebhookValidation/MutationValidate and set defaults on CRD creation/modification
Metrics ServerMetrics collectionExpose agent state and performance metrics
Agent CRDAgent definitionSpec, model, and tool configuration for AI agents
Tool CRDTool definitionDefine tools (API, search, etc.) for agent use
Workflow CRDWorkflow definitionDefine multi-agent collaboration workflows

Component Interaction

Kagent Installation

Prerequisites

  • Kubernetes cluster (v1.25 or later)
  • kubectl CLI tool
  • Helm v3 (for Helm installation)
  • cert-manager (for Webhook TLS certificate management)
cert-manager Required

Kagent's Admission Webhook requires TLS certificates. cert-manager must be installed in the cluster before installation.

Helm Chart Installation

Helm installation is the most recommended method.

1. Add Helm Repository

# Add Kagent Helm repository
helm repo add kagent https://kagent-dev.github.io/kagent
helm repo update

# Check available versions
helm search repo kagent --versions

2. Create Namespace

# Create Kagent system namespace
kubectl create namespace kagent-system

# Create namespace for agent deployment
kubectl create namespace ai-agents

3. Configure values.yaml

# values.yaml
controller:
# Controller replica count (high availability)
replicaCount: 2

# Resource configuration
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi

# Log level
logLevel: info

# Metrics configuration
metrics:
enabled: true
port: 8080

webhook:
# Enable webhook
enabled: true

# Certificate configuration (using cert-manager)
certManager:
enabled: true
issuerRef:
name: kagent-selfsigned-issuer
kind: Issuer

# Monitoring configuration
monitoring:
# Create ServiceMonitor (Prometheus Operator)
serviceMonitor:
enabled: true
namespace: observability
interval: 30s

# RBAC configuration
rbac:
create: true

# Service account
serviceAccount:
create: true
name: kagent-controller

# Node selector
nodeSelector:
kubernetes.io/os: linux

# Tolerations
tolerations: []

# Affinity
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- kagent
topologyKey: kubernetes.io/hostname

4. Execute Helm Installation

# Install Kagent
helm install kagent kagent/kagent \
--namespace kagent-system \
--values values.yaml \
--wait

# Check installation status
helm status kagent -n kagent-system

Manifest Installation

If not using Helm, you can apply manifests directly.

1. Install CRD

# Download and apply CRD manifest
kubectl apply -f https://github.com/kagent-dev/kagent/releases/latest/download/crds.yaml

# Verify CRD installation
kubectl get crds | grep kagent

Expected output:

agents.kagent.dev                    2025-02-05T00:00:00Z
tools.kagent.dev 2025-02-05T00:00:00Z
workflows.kagent.dev 2025-02-05T00:00:00Z
memorystores.kagent.dev 2025-02-05T00:00:00Z

2. Deploy Controller

# kagent-controller.yaml
apiVersion: v1
kind: Namespace
metadata:
name: kagent-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kagent-controller
namespace: kagent-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kagent-controller-role
rules:
- apiGroups: ["kagent.dev"]
resources: ["agents", "tools", "workflows", "memorystores"]
verbs: ["*"]
- apiGroups: ["kagent.dev"]
resources: ["agents/status", "workflows/status"]
verbs: ["get", "update", "patch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["*"]
- apiGroups: [""]
resources: ["services", "configmaps", "secrets", "pods"]
verbs: ["*"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["*"]
- apiGroups: ["keda.sh"]
resources: ["scaledobjects"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kagent-controller-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kagent-controller-role
subjects:
- kind: ServiceAccount
name: kagent-controller
namespace: kagent-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kagent-controller
namespace: kagent-system
spec:
replicas: 2
selector:
matchLabels:
app: kagent-controller
template:
metadata:
labels:
app: kagent-controller
spec:
serviceAccountName: kagent-controller
containers:
- name: controller
image: ghcr.io/kagent-dev/kagent-controller:latest
args:
- --leader-elect=true
- --metrics-bind-address=:8080
- --health-probe-bind-address=:8081
ports:
- containerPort: 8080
name: metrics
- containerPort: 8081
name: health
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /readyz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10
# Deploy controller
kubectl apply -f kagent-controller.yaml

Installation Verification

After installation is complete, verify status with the following commands.

# Check controller pod status
kubectl get pods -n kagent-system

# Expected output:
# NAME READY STATUS RESTARTS AGE
# kagent-controller-5d4f6b7c8d-abc12 1/1 Running 0 2m
# kagent-controller-5d4f6b7c8d-def34 1/1 Running 0 2m

# Check CRD
kubectl get crds | grep kagent.dev

# Check controller logs
kubectl logs -n kagent-system -l app=kagent-controller --tail=50

# Check webhook status (Helm installation)
kubectl get validatingwebhookconfigurations | grep kagent
kubectl get mutatingwebhookconfigurations | grep kagent
Installation Troubleshooting

If the controller fails to start:

  1. Check events with kubectl describe pod -n kagent-system <pod-name>
  2. Verify RBAC permissions are correctly configured
  3. Verify cert-manager is functioning properly (when using Webhook)

Agent CRD Definition

Agent CRD declaratively defines all settings for AI agents.

Agent Resource Spec

apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
name: customer-support-agent
namespace: ai-agents
labels:
app: customer-support
team: support
environment: production
spec:
# Agent basic information
displayName: "Customer Support Agent"
description: "AI agent that responds to customer inquiries and creates tickets"

# Model configuration
model:
provider: openai # openai, anthropic, bedrock, vllm
name: gpt-4-turbo
endpoint: "" # Custom endpoint (vLLM, etc.)
temperature: 0.7
maxTokens: 4096
topP: 0.9
frequencyPenalty: 0.0
presencePenalty: 0.0
# API key reference
apiKeySecretRef:
name: openai-api-key
key: api-key

# System prompt
systemPrompt: |
You are a friendly and professional customer support agent.

## Role
- Provide accurate and helpful responses to customer inquiries
- Search knowledge base when needed to verify information
- Create tickets for unresolved issues

## Guidelines
- Always maintain a polite and empathetic attitude
- Honestly acknowledge when you don't know something
- Guide identity verification process when sensitive information is requested

# List of tools to use
tools:
- name: search-knowledge-base
- name: create-ticket
- name: get-customer-info

# Memory configuration
memory:
type: redis
config:
host: redis-master.ai-data.svc.cluster.local
port: 6379
database: 0
ttl: 3600 # Session TTL (seconds)
maxHistory: 50 # Maximum conversation history count
secretRef:
name: redis-credentials
key: password

# Scaling configuration
scaling:
minReplicas: 2
maxReplicas: 10
metrics:
- type: cpu
target:
type: Utilization
averageUtilization: 70
- type: memory
target:
type: Utilization
averageUtilization: 80
# KEDA scaling (optional)
keda:
enabled: true
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.observability.svc:9090
metricName: agent_active_sessions
threshold: "50"
query: sum(agent_active_sessions{agent="customer-support"})

# Resource limits
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"

# Environment variables
env:
- name: LOG_LEVEL
value: "info"
- name: LANGFUSE_ENABLED
value: "true"
- name: LANGFUSE_HOST
value: "http://langfuse.observability.svc:3000"

# Observability configuration
observability:
tracing:
enabled: true
provider: langfuse # langfuse, langsmith, cloudwatch
sampleRate: 1.0
metrics:
enabled: true
port: 9090
# CloudWatch Generative AI Observability configuration (optional)
cloudwatch:
enabled: false
region: ap-northeast-2
namespace: AgenticAI/Agents

# Health check
healthCheck:
enabled: true
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 30

Tool CRD Definition

Tool CRD defines tools that agents can use.

apiVersion: kagent.dev/v1alpha1
kind: Tool
metadata:
name: search-knowledge-base
namespace: ai-agents
labels:
category: retrieval
spec:
# Tool type: api, retrieval, code, human
type: retrieval

# Tool description (referenced by LLM when selecting tools)
displayName: "Search Knowledge Base"
description: |
Search for relevant documents in the company's knowledge base.
Use this to find information to answer customer inquiries.

# Retrieval configuration
retrieval:
vectorStore:
type: milvus
host: milvus-proxy.ai-data.svc.cluster.local
port: 19530
collection: support-knowledge
embedding:
provider: openai
model: text-embedding-3-small
dimension: 1536
search:
topK: 5
scoreThreshold: 0.7
filter: ""

# Input parameter definition
parameters:
- name: query
type: string
required: true
description: "Question or keywords to search"
- name: category
type: string
required: false
description: "Document category filter (e.g., faq, manual, policy)"
enum: ["faq", "manual", "policy", "all"]
default: "all"

# Output schema
output:
type: array
items:
type: object
properties:
content:
type: string
description: "Document content"
score:
type: number
description: "Similarity score"
metadata:
type: object
description: "Document metadata"
---
apiVersion: kagent.dev/v1alpha1
kind: Tool
metadata:
name: create-ticket
namespace: ai-agents
labels:
category: api
spec:
type: api

displayName: "Create Ticket"
description: |
Create a customer inquiry as a ticket.
Use when the agent cannot resolve the issue directly.

# API configuration
api:
endpoint: http://ticketing-service.support.svc:8080/api/v1/tickets
method: POST
timeout: 30s
retries: 3
headers:
Content-Type: application/json
# Authentication configuration
authentication:
type: bearer
secretRef:
name: ticketing-api-token
key: token

parameters:
- name: title
type: string
required: true
description: "Ticket title"
maxLength: 200
- name: description
type: string
required: true
description: "Detailed problem description"
- name: priority
type: string
required: false
description: "Priority"
enum: ["low", "medium", "high", "urgent"]
default: "medium"
- name: category
type: string
required: true
description: "Inquiry category"
enum: ["billing", "technical", "general", "complaint"]
- name: customer_id
type: string
required: true
description: "Customer ID"

output:
type: object
properties:
ticket_id:
type: string
description: "Created ticket ID"
status:
type: string
description: "Ticket status"
created_at:
type: string
description: "Creation time"
---
apiVersion: kagent.dev/v1alpha1
kind: Tool
metadata:
name: get-customer-info
namespace: ai-agents
labels:
category: api
spec:
type: api

displayName: "Retrieve Customer Information"
description: |
Retrieve customer information by customer ID.
Use when customer verification is needed.

api:
endpoint: http://customer-service.crm.svc:8080/api/v1/customers/{customer_id}
method: GET
timeout: 10s
authentication:
type: bearer
secretRef:
name: crm-api-token
key: token

parameters:
- name: customer_id
type: string
required: true
description: "Customer ID to retrieve"
pattern: "^[A-Z0-9]{8}$"

output:
type: object
properties:
id:
type: string
name:
type: string
email:
type: string
tier:
type: string
created_at:
type: string

Memory Configuration

Memory configuration for storing agent conversation context and state.

apiVersion: kagent.dev/v1alpha1
kind: MemoryStore
metadata:
name: agent-memory-redis
namespace: ai-agents
spec:
# Memory type: redis, postgres, in-memory
type: redis

# Redis configuration
redis:
host: redis-master.ai-data.svc.cluster.local
port: 6379
database: 0
# TLS configuration
tls:
enabled: true
secretRef:
name: redis-tls-cert
# Authentication
auth:
secretRef:
name: redis-credentials
passwordKey: password

# Memory policy
policy:
# Session TTL
sessionTTL: 3600
# Maximum conversation history
maxConversationHistory: 100
# Memory compression (summarize long conversations)
compression:
enabled: true
threshold: 50
model: gpt-3.5-turbo
# Long-term memory configuration
longTermMemory:
enabled: true
vectorStore:
type: milvus
collection: agent-memories

Scaling Configuration

Detailed configuration for agent auto-scaling.

# HPA-based scaling
scaling:
minReplicas: 2
maxReplicas: 20

# Scaling behavior configuration
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max

# Metrics-based scaling
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Custom metrics
- type: Pods
pods:
metric:
name: agent_requests_per_second
target:
type: AverageValue
averageValue: "100"

Agent Lifecycle Management

Agent Deployment Process

1. Preparation

# Create namespace
kubectl create namespace ai-agents

# Create API key secret
kubectl create secret generic openai-api-key \
--namespace ai-agents \
--from-literal=api-key='sk-your-api-key-here'

# Create Redis authentication secret
kubectl create secret generic redis-credentials \
--namespace ai-agents \
--from-literal=password='your-redis-password'

2. Deploy Tool Resources

# Apply Tool CRD
kubectl apply -f tools/search-knowledge-base.yaml
kubectl apply -f tools/create-ticket.yaml
kubectl apply -f tools/get-customer-info.yaml

# Check tool status
kubectl get tools -n ai-agents

3. Deploy Agent Resources

# Apply Agent CRD
kubectl apply -f agents/customer-support-agent.yaml

# Check deployment status
kubectl get agents -n ai-agents

# Check detailed status
kubectl describe agent customer-support-agent -n ai-agents

4. Deployment Verification

# Check created resources
kubectl get deployments -n ai-agents
kubectl get services -n ai-agents
kubectl get hpa -n ai-agents

# Check pod status
kubectl get pods -n ai-agents -l app=customer-support-agent

# Check logs
kubectl logs -n ai-agents -l app=customer-support-agent --tail=100

# Test agent endpoint
kubectl port-forward svc/customer-support-agent 8080:8080 -n ai-agents

# Test in another terminal
curl -X POST http://localhost:8080/chat \
-H "Content-Type: application/json" \
-d '{"message": "Hello, I have a billing inquiry."}'

Update Procedure

Procedure for changing and updating agent configuration.

Change Configuration and Apply

# Check current configuration
kubectl get agent customer-support-agent -n ai-agents -o yaml

# Modify configuration (using editor)
kubectl edit agent customer-support-agent -n ai-agents

# Or apply after modifying file
kubectl apply -f agents/customer-support-agent.yaml

Monitor Rolling Update

# Monitor update status
kubectl rollout status deployment/customer-support-agent -n ai-agents

# Check pod replacement status
kubectl get pods -n ai-agents -l app=customer-support-agent -w

# Check events
kubectl get events -n ai-agents --sort-by='.lastTimestamp' | grep customer-support

Canary Deployment (Optional)

apiVersion: kagent.dev/v1alpha1
kind: Agent
metadata:
name: customer-support-agent-canary
namespace: ai-agents
labels:
app: customer-support
version: canary
spec:
# Test with new model or configuration
model:
provider: openai
name: gpt-4o # New model
temperature: 0.5

# Test with minimum replicas
scaling:
minReplicas: 1
maxReplicas: 2

# Rest of configuration same...

Rollback Procedure

Procedure for rolling back to previous version when issues occur.

Deployment Rollback

# Check rollout history
kubectl rollout history deployment/customer-support-agent -n ai-agents

# Check specific revision details
kubectl rollout history deployment/customer-support-agent -n ai-agents --revision=2

# Rollback to previous version
kubectl rollout undo deployment/customer-support-agent -n ai-agents

# Rollback to specific revision
kubectl rollout undo deployment/customer-support-agent -n ai-agents --to-revision=2

# Check rollback status
kubectl rollout status deployment/customer-support-agent -n ai-agents

Agent CRD Rollback

# Apply previous version of Agent CRD
kubectl apply -f agents/customer-support-agent-v1.yaml

# Or restore previous version from Git
git checkout HEAD~1 -- agents/customer-support-agent.yaml
kubectl apply -f agents/customer-support-agent.yaml
Rollback Precautions
  • Back up current state before rollback
  • Check data compatibility if database schema changes occurred
  • Test all functions work properly after rollback

Multi-Agent Orchestration

Define workflows where multiple agents collaborate to handle complex tasks.

Agent-to-Agent Communication

Workflow Definition

Define multi-agent workflows using Workflow CRD.

apiVersion: kagent.dev/v1alpha1
kind: Workflow
metadata:
name: research-report-workflow
namespace: ai-agents
spec:
displayName: "Research Report Generation Workflow"
description: "Perform research on a topic and generate an analysis report"

# Workflow input
input:
- name: topic
type: string
required: true
description: "Research topic"
- name: depth
type: string
required: false
default: "standard"
enum: ["quick", "standard", "deep"]

# Define workflow steps
steps:
# Step 1: Information gathering
- name: research
agent: research-agent
input:
topic: "{{ .input.topic }}"
sources: ["web", "academic", "news"]
output:
- name: research_data
path: ".result.data"
timeout: 300s
retries: 2

# Step 2: Data analysis (parallel execution)
- name: analyze-trends
agent: analysis-agent
dependsOn: [research]
input:
data: "{{ .steps.research.output.research_data }}"
analysis_type: "trend"
output:
- name: trend_analysis
path: ".result"
parallel: true

- name: analyze-sentiment
agent: analysis-agent
dependsOn: [research]
input:
data: "{{ .steps.research.output.research_data }}"
analysis_type: "sentiment"
output:
- name: sentiment_analysis
path: ".result"
parallel: true

# Step 3: Write report
- name: write-report
agent: writer-agent
dependsOn: [analyze-trends, analyze-sentiment]
input:
research: "{{ .steps.research.output.research_data }}"
trends: "{{ .steps.analyze-trends.output.trend_analysis }}"
sentiment: "{{ .steps.analyze-sentiment.output.sentiment_analysis }}"
format: "markdown"
output:
- name: report
path: ".result.document"

# Step 4: Review and revise
- name: review
agent: reviewer-agent
dependsOn: [write-report]
input:
document: "{{ .steps.write-report.output.report }}"
criteria: ["accuracy", "clarity", "completeness"]
output:
- name: final_report
path: ".result.reviewed_document"

# Workflow output
output:
report: "{{ .steps.review.output.final_report }}"
metadata:
research_sources: "{{ .steps.research.output.research_data.sources }}"
analysis_summary: "{{ .steps.analyze-trends.output.trend_analysis.summary }}"

# Error handling
errorHandling:
# Action on step failure
onStepFailure: retry
maxRetries: 3
# Action on workflow failure
onWorkflowFailure: notify
notificationChannel:
type: slack
webhook:
secretRef:
name: slack-webhook
key: url

# Timeout configuration
timeout: 1800s # 30 minutes

# Concurrent execution limit
concurrency:
maxConcurrent: 5
policy: queue # queue, reject, replace

Workflow Execution

# Apply workflow definition
kubectl apply -f workflows/research-report-workflow.yaml

# Execute workflow (create WorkflowRun)
cat <<EOF | kubectl apply -f -
apiVersion: kagent.dev/v1alpha1
kind: WorkflowRun
metadata:
name: research-run-001
namespace: ai-agents
spec:
workflowRef:
name: research-report-workflow
input:
topic: "2024 AI Trends Analysis"
depth: "deep"
EOF

# Check execution status
kubectl get workflowruns -n ai-agents

# Check detailed status
kubectl describe workflowrun research-run-001 -n ai-agents

# Check execution logs
kubectl logs -n ai-agents -l workflow-run=research-run-001 --tail=100

Workflow Monitoring

# Check workflow status
apiVersion: kagent.dev/v1alpha1
kind: WorkflowRun
metadata:
name: research-run-001
status:
phase: Running # Pending, Running, Succeeded, Failed
startTime: "2025-02-05T10:00:00Z"
steps:
- name: research
phase: Succeeded
startTime: "2025-02-05T10:00:00Z"
completionTime: "2025-02-05T10:03:00Z"
- name: analyze-trends
phase: Running
startTime: "2025-02-05T10:03:00Z"
- name: analyze-sentiment
phase: Running
startTime: "2025-02-05T10:03:00Z"
- name: write-report
phase: Pending
- name: review
phase: Pending
conditions:
- type: Initialized
status: "True"
- type: Running
status: "True"

Operations Guide

Monitoring Configuration

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kagent-agents
namespace: observability
spec:
selector:
matchLabels:
kagent.dev/monitored: "true"
namespaceSelector:
matchNames:
- ai-agents
endpoints:
- port: metrics
interval: 15s
path: /metrics
---
# PrometheusRule for Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kagent-alerts
namespace: observability
spec:
groups:
- name: kagent-agent-alerts
rules:
- alert: AgentHighErrorRate
expr: |
sum(rate(agent_request_errors_total[5m])) by (agent) /
sum(rate(agent_request_total[5m])) by (agent) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Agent {{ $labels.agent }} error rate increased"
description: "Error rate exceeded 5%. Current: {{ $value | humanizePercentage }}"

- alert: AgentHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(agent_request_duration_seconds_bucket[5m])) by (agent, le)
) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.agent }} response delay"
description: "P99 latency exceeded 30 seconds"

- alert: AgentPodNotReady
expr: |
kube_deployment_status_replicas_ready{deployment=~".*-agent"} /
kube_deployment_status_replicas{deployment=~".*-agent"} < 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "Agent pod availability degraded"
description: "Ready pods for {{ $labels.deployment }} are below 50%"

Logging Configuration

# Agent logging ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: agent-logging-config
namespace: ai-agents
data:
logging.yaml: |
version: 1
disable_existing_loggers: false
formatters:
json:
class: pythonjsonlogger.jsonlogger.JsonFormatter
format: "%(asctime)s %(name)s %(levelname)s %(message)s"
handlers:
console:
class: logging.StreamHandler
formatter: json
stream: ext://sys.stdout
loggers:
kagent:
level: INFO
handlers: [console]
propagate: false
langchain:
level: WARNING
handlers: [console]
propagate: false
root:
level: INFO
handlers: [console]

Troubleshooting

Common Problem Resolution

IssueCauseSolutionSeverity
Pod CrashLoopBackOffAPI key error, insufficient memoryVerify secrets, increase resourcesHigh
High latencyModel response delay, network issuesAdjust timeout, change modelMedium
Tool execution failureEndpoint error, auth failureVerify tool config, refresh secretHigh
Scaling not workingMetric collection failed, HPA config errorCheck Prometheus connection, validate HPAMedium

Debugging Commands

# Check detailed agent status
kubectl describe agent <agent-name> -n ai-agents

# Check pod events
kubectl get events -n ai-agents --field-selector involvedObject.name=<pod-name>

# Check container logs (including previous containers)
kubectl logs <pod-name> -n ai-agents --previous

# Stream logs in real-time
kubectl logs -f -l app=<agent-name> -n ai-agents

# Access inside pod
kubectl exec -it <pod-name> -n ai-agents -- /bin/sh

# Test network connectivity
kubectl run debug --rm -it --image=curlimages/curl -- \
curl -v http://customer-support-agent.ai-agents.svc:8080/health

Conclusion

Using Kagent enables declarative management of AI agents in Kubernetes environments. Key benefits include:

  • Declarative Management: GitOps workflow support with YAML-based agent definitions
  • Automated Operations: Automatic recovery and scaling through Operator pattern
  • Standardization: Agent definition standardization through CRD
  • Scalability: Leveraging Kubernetes-native scaling mechanisms
  • Observability: Integrated monitoring and tracking support
Next Steps

References