AI Agent Monitoring and Operations
This document covers the monitoring architecture, key metric design, and alerting strategy for Agentic AI applications at a conceptual level.
For Langfuse Helm deployment, AMP/AMG configuration, ServiceMonitor YAML, and Grafana dashboard JSON, see the Monitoring Stack Setup Guide.
1. Overview
Agentic AI applications perform complex reasoning chains and various tool calls, making it difficult to achieve sufficient visibility with traditional APM (Application Performance Monitoring) tools alone. LLM-specialized observability tools like Langfuse and LangSmith provide the following core capabilities:
- Trace tracking: Full flow tracking of LLM calls, tool execution, and agent reasoning processes
- Token usage analysis: Input/output token counts and cost calculation
- Quality evaluation: Response quality scoring and feedback collection
- Debugging: Problem diagnosis through prompt and response content review
This document is intended for platform operators, MLOps engineers, and AI developers. Basic understanding of Kubernetes and Python is required.
2. Monitoring Architecture
Langfuse Architecture Overview
Langfuse v3.162.0+ consists of the following components:
AMP/AMG Integrated Data Flow
Monitoring Data Layers
| Layer | Collection Tool | Metric Pattern | Visible Items |
|---|---|---|---|
| LLM Inference | Langfuse | trace, generation | Token usage, cost, TTFT, per-user patterns |
| Model Server | vLLM Prometheus | vllm_* | Request count, batch size, KV cache utilization, TPS |
| GPU | DCGM Exporter | DCGM_FI_DEV_* | GPU utilization, temperature, power, memory usage |
| Infrastructure | Node Exporter | node_* | CPU, memory, network, disk I/O |
| Gateway | kgateway | envoy_* | Request count, latency, error rate, upstream status |
3. Langfuse vs LangSmith Comparison
| Feature | Langfuse | LangSmith |
|---|---|---|
| License | Open source (MIT) | Commercial (free tier) |
| Deployment | Self-hosted / Cloud | Cloud only |
| Data Sovereignty | Full control | LangChain servers |
| Integration | Multiple frameworks | LangChain optimized |
| Cost | Infrastructure only | Usage-based pricing |
| Scalability | Kubernetes native | Managed |
- Langfuse: When data sovereignty is important or cost optimization is needed
- LangSmith: When LangChain-based development is the focus and quick start is needed
AWS Native Observability: CloudWatch Generative AI Observability
Amazon CloudWatch Generative AI Observability is an AWS-native solution for LLM and AI agent monitoring:
- Infrastructure-agnostic monitoring: Supports AI workloads across Bedrock, EKS, ECS, on-premises, and more
- Agent/tool tracking: Built-in views for agents, knowledge bases, and tool calls
- End-to-end tracing: Tracking across the entire AI stack
- Framework compatibility: Support for external frameworks like LangChain, LangGraph, CrewAI
Using Langfuse v3.x (self-hosted data sovereignty) together with CloudWatch Gen AI Observability (AWS-native integration) provides the most comprehensive observability.
4. Key Monitoring Metrics
Defines the key metrics to track in Agentic AI applications.
Metric Categories
Latency Metrics
| Metric | Description | Target | Alert Threshold |
|---|---|---|---|
| agent_request_duration_seconds | Total request processing time | P95 < 5s | P99 > 10s |
| llm_inference_duration_seconds | LLM inference time | P95 < 3s | P99 > 8s |
| tool_execution_duration_seconds | Tool execution time | P95 < 1s | P99 > 3s |
| vector_search_duration_seconds | Vector search time | P95 < 200ms | P99 > 500ms |
Token Usage Metrics
| Metric | Description | Monitoring Purpose |
|---|---|---|
| llm_input_tokens_total | Total input tokens | Prompt optimization |
| llm_output_tokens_total | Total output tokens | Response length analysis |
| llm_total_tokens_total | Total tokens | Cost tracking |
| llm_cost_dollars_total | Estimated cost (USD) | Budget management |
Error Rate Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| agent_errors_total | Total agent errors | Error rate > 5% |
| llm_rate_limit_errors_total | Rate Limit errors | > 10 per minute |
| tool_execution_errors_total | Tool execution errors | Error rate > 10% |
| agent_timeout_total | Timeout occurrences | > 5 per minute |
5. PromQL Query Reference
GPU Metrics
# Overall GPU average utilization
avg(DCGM_FI_DEV_GPU_UTIL)
# Per-node GPU utilization
avg(DCGM_FI_DEV_GPU_UTIL) by (Hostname)
# GPU memory utilization
avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE * 100) by (gpu)
vLLM Metrics
# Overall TPS (tokens generated per second)
rate(vllm_generation_tokens_total[5m])
# Per-model TPS
sum(rate(vllm_generation_tokens_total[5m])) by (model)
# TTFT P99 (Time to First Token)
histogram_quantile(0.99, rate(vllm_time_to_first_token_seconds_bucket[5m]))
# TTFT P95
histogram_quantile(0.95, rate(vllm_time_to_first_token_seconds_bucket[5m]))
# E2E Latency P99
histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m]))
# Average batch size
avg(vllm_num_requests_running)
Gateway Metrics
# 5xx error rate (%)
rate(envoy_http_downstream_rq_xx{envoy_response_code_class="5"}[5m])
/
rate(envoy_http_downstream_rq_total[5m]) * 100
# Upstream health check failure rate
sum(rate(envoy_cluster_upstream_cx_connect_fail[5m])) by (envoy_cluster_name)
Cost Metrics
# Daily total cost
sum(increase(llm_cost_dollars_total[24h]))
# Per-tenant daily cost
sum(increase(llm_cost_dollars_total[24h])) by (tenant_id)
# Per-model cost ratio
sum(increase(llm_cost_dollars_total[24h])) by (model)
/ ignoring(model) group_left
sum(increase(llm_cost_dollars_total[24h]))
# Budget utilization (monthly)
sum(increase(llm_cost_dollars_total[30d])) by (tenant_id)
/ on(tenant_id) group_left
tenant_monthly_budget_usd
6. Alerting Strategy
Alert Threshold Design
| Alert | Condition | Severity | Duration |
|---|---|---|---|
| Agent High Latency | P99 latency > 10s | Warning | 5 min |
| Agent High Error Rate | Error rate > 5% | Critical | 5 min |
| LLM Rate Limit | Rate limit errors > 10/5min | Warning | 2 min |
| Daily Cost Budget | Daily cost > $100 | Warning | Immediate |
| GPU High Temperature | GPU temp > 85C | Warning | 5 min |
| GPU Memory Full | GPU memory > 95% | Critical | 3 min |
| vLLM High Latency | P99 E2E latency > 30s | Warning | 5 min |
Alert Hierarchy
- Infrastructure layer: GPU temperature, memory, power anomalies
- Model server layer: vLLM latency increase, KV cache shortage
- Application layer: Agent error rate, Rate limit
- Business layer: Cost overrun, SLA violations
- Cross-layer metric correlation: Analyze correlations — LLM request increase -> GPU utilization rise -> infrastructure load increase
- Anomaly detection: When P99 latency suddenly increases, simultaneously check GPU temperature and memory usage
- Capacity planning: Consider provisioning additional GPU nodes when average GPU utilization exceeds 70%
- Cost optimization: Prioritize models with lower TTFT to improve user experience + increase throughput
7. Cost Tracking
Cost Tracking Concepts
Track LLM usage costs by the following criteria:
- Per-model: Total cost and request count per model, identifying the most expensive models
- Per-tenant: Per-tenant/team daily token usage and budget utilization
- Per-time: Peak time analysis, cost trends
Per-Model Cost Reference (2026-04 baseline)1
| Tier | Model | Input ($/1M tok) | Output ($/1M tok) | Features |
|---|---|---|---|---|
| Frontier | Claude Opus 4.7 | $15 | $75 | Highest quality reasoning |
| Frontier | GPT-4.1 / o3 | $10 | $30 | Complex reasoning |
| Frontier | Gemini 2.5 Pro | $1.25 | $5 | Enhanced multimodal |
| Balanced | Claude Sonnet 4.6 | $3 | $15 | Quality-cost balance |
| Balanced | GPT-4.1 mini | $0.40 | $1.60 | Fast inference |
| Balanced | Gemini 2.5 Flash | $0.10 | $0.40 | High throughput |
| Fast/Cheap | Claude Haiku 4.5 | $0.80 | $4 | Simple tasks |
| Fast/Cheap | GPT-4.1 nano / o4-mini | $0.15 | $0.60 | Ultra-low cost |
| Fast/Cheap | Gemini 2.5 Flash-Lite | $0.05 | $0.20 | Minimal latency |
| Open-weight | DeepSeek V3.1 | Self-hosted | Self-hosted | Open license |
| Open-weight | Llama 4 Scout | Self-hosted | Self-hosted | Meta official |
| Open-weight | Qwen3-72B | Self-hosted | Self-hosted | Alibaba Cloud |
- Model selection optimization: Use cheaper models (GPT-4.1 nano, Haiku 4.5, Gemini 2.5 Flash-Lite) for simple tasks
- Prompt optimization: Reduce input tokens by removing unnecessary context
- Caching: Cache responses for repetitive queries (Prompt Caching, Semantic Caching)
- Cascade Routing: Try low-cost model first, fallback to high-performance model on failure — 66% cost savings possible
- Open-weight models: Convert to fixed costs with DeepSeek V3.1, Llama 4, Qwen3 when self-hosting
8. Operations Checklist
Daily Checks
| Check Item | How to Check | Normal Status |
|---|---|---|
| GPU Status | `kubectl get nodes -l nvidia.com/gpu.present=true` | All nodes Ready |
| Model Pods | `kubectl get pods -n inference` | Running state |
| Error Rate | Grafana dashboard | < 1% |
| Response Time | P99 latency | < 5 seconds |
| GPU Utilization | DCGM metrics | 40-80% |
| Memory Usage | GPU memory | < 90% |
Weekly Checks
| Check Item | How to Check | Action |
|---|---|---|
| Cost Analysis | Kubecost report | Identify anomalous costs |
| Capacity Planning | Resource trends | Plan scaling |
| Security Patches | Image scan | Patch vulnerabilities |
| Backup Validation | Recovery test | Verify backup policy |
9. Monitoring Maturity Model
10. Next Steps
- Monitoring Stack Setup Guide - AMP/AMG deployment, Langfuse Helm installation, ServiceMonitor, Grafana dashboard production setup
- LLMOps Observability Comparison Guide - In-depth comparison of Langfuse vs LangSmith vs Helicone
- Agentic AI Platform Architecture - Overall platform design
- RAG Evaluation Framework - Quality evaluation with Ragas
References
- Langfuse Documentation
- LangSmith Documentation
- CloudWatch Generative AI Observability
- OpenTelemetry Documentation
- Prometheus Monitoring