Skip to main content

AI Agent Monitoring and Operations

This document covers the monitoring architecture, key metric design, and alerting strategy for Agentic AI applications at a conceptual level.

Production Deployment Guide

For Langfuse Helm deployment, AMP/AMG configuration, ServiceMonitor YAML, and Grafana dashboard JSON, see the Monitoring Stack Setup Guide.

1. Overview

Agentic AI applications perform complex reasoning chains and various tool calls, making it difficult to achieve sufficient visibility with traditional APM (Application Performance Monitoring) tools alone. LLM-specialized observability tools like Langfuse and LangSmith provide the following core capabilities:

  • Trace tracking: Full flow tracking of LLM calls, tool execution, and agent reasoning processes
  • Token usage analysis: Input/output token counts and cost calculation
  • Quality evaluation: Response quality scoring and feedback collection
  • Debugging: Problem diagnosis through prompt and response content review
Target Audience

This document is intended for platform operators, MLOps engineers, and AI developers. Basic understanding of Kubernetes and Python is required.


2. Monitoring Architecture

Langfuse Architecture Overview

Langfuse v3.162.0+ consists of the following components:

AMP/AMG Integrated Data Flow

Monitoring Data Layers

LayerCollection ToolMetric PatternVisible Items
LLM InferenceLangfusetrace, generationToken usage, cost, TTFT, per-user patterns
Model ServervLLM Prometheusvllm_*Request count, batch size, KV cache utilization, TPS
GPUDCGM ExporterDCGM_FI_DEV_*GPU utilization, temperature, power, memory usage
InfrastructureNode Exporternode_*CPU, memory, network, disk I/O
Gatewaykgatewayenvoy_*Request count, latency, error rate, upstream status

3. Langfuse vs LangSmith Comparison

Langfuse vs LangSmith Comparison
FeatureLangfuseLangSmith
LicenseOpen source (MIT)Commercial (free tier)
DeploymentSelf-hosted / CloudCloud only
Data SovereigntyFull controlLangChain servers
IntegrationMultiple frameworksLangChain optimized
CostInfrastructure onlyUsage-based pricing
ScalabilityKubernetes nativeManaged
Selection Guide
  • Langfuse: When data sovereignty is important or cost optimization is needed
  • LangSmith: When LangChain-based development is the focus and quick start is needed

AWS Native Observability: CloudWatch Generative AI Observability

Amazon CloudWatch Generative AI Observability is an AWS-native solution for LLM and AI agent monitoring:

  • Infrastructure-agnostic monitoring: Supports AI workloads across Bedrock, EKS, ECS, on-premises, and more
  • Agent/tool tracking: Built-in views for agents, knowledge bases, and tool calls
  • End-to-end tracing: Tracking across the entire AI stack
  • Framework compatibility: Support for external frameworks like LangChain, LangGraph, CrewAI

Using Langfuse v3.x (self-hosted data sovereignty) together with CloudWatch Gen AI Observability (AWS-native integration) provides the most comprehensive observability.


4. Key Monitoring Metrics

Defines the key metrics to track in Agentic AI applications.

Metric Categories

Latency Metrics

Latency Metrics
MetricDescriptionTargetAlert Threshold
agent_request_duration_secondsTotal request processing timeP95 < 5sP99 > 10s
llm_inference_duration_secondsLLM inference timeP95 < 3sP99 > 8s
tool_execution_duration_secondsTool execution timeP95 < 1sP99 > 3s
vector_search_duration_secondsVector search timeP95 < 200msP99 > 500ms

Token Usage Metrics

Token Usage Metrics
MetricDescriptionMonitoring Purpose
llm_input_tokens_totalTotal input tokensPrompt optimization
llm_output_tokens_totalTotal output tokensResponse length analysis
llm_total_tokens_totalTotal tokensCost tracking
llm_cost_dollars_totalEstimated cost (USD)Budget management

Error Rate Metrics

Error Rate Metrics
MetricDescriptionAlert Threshold
agent_errors_totalTotal agent errorsError rate > 5%
llm_rate_limit_errors_totalRate Limit errors> 10 per minute
tool_execution_errors_totalTool execution errorsError rate > 10%
agent_timeout_totalTimeout occurrences> 5 per minute

5. PromQL Query Reference

GPU Metrics

# Overall GPU average utilization
avg(DCGM_FI_DEV_GPU_UTIL)

# Per-node GPU utilization
avg(DCGM_FI_DEV_GPU_UTIL) by (Hostname)

# GPU memory utilization
avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE * 100) by (gpu)

vLLM Metrics

# Overall TPS (tokens generated per second)
rate(vllm_generation_tokens_total[5m])

# Per-model TPS
sum(rate(vllm_generation_tokens_total[5m])) by (model)

# TTFT P99 (Time to First Token)
histogram_quantile(0.99, rate(vllm_time_to_first_token_seconds_bucket[5m]))

# TTFT P95
histogram_quantile(0.95, rate(vllm_time_to_first_token_seconds_bucket[5m]))

# E2E Latency P99
histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m]))

# Average batch size
avg(vllm_num_requests_running)

Gateway Metrics

# 5xx error rate (%)
rate(envoy_http_downstream_rq_xx{envoy_response_code_class="5"}[5m])
/
rate(envoy_http_downstream_rq_total[5m]) * 100

# Upstream health check failure rate
sum(rate(envoy_cluster_upstream_cx_connect_fail[5m])) by (envoy_cluster_name)

Cost Metrics

# Daily total cost
sum(increase(llm_cost_dollars_total[24h]))

# Per-tenant daily cost
sum(increase(llm_cost_dollars_total[24h])) by (tenant_id)

# Per-model cost ratio
sum(increase(llm_cost_dollars_total[24h])) by (model)
/ ignoring(model) group_left
sum(increase(llm_cost_dollars_total[24h]))

# Budget utilization (monthly)
sum(increase(llm_cost_dollars_total[30d])) by (tenant_id)
/ on(tenant_id) group_left
tenant_monthly_budget_usd

6. Alerting Strategy

Alert Threshold Design

AlertConditionSeverityDuration
Agent High LatencyP99 latency > 10sWarning5 min
Agent High Error RateError rate > 5%Critical5 min
LLM Rate LimitRate limit errors > 10/5minWarning2 min
Daily Cost BudgetDaily cost > $100WarningImmediate
GPU High TemperatureGPU temp > 85CWarning5 min
GPU Memory FullGPU memory > 95%Critical3 min
vLLM High LatencyP99 E2E latency > 30sWarning5 min

Alert Hierarchy

  1. Infrastructure layer: GPU temperature, memory, power anomalies
  2. Model server layer: vLLM latency increase, KV cache shortage
  3. Application layer: Agent error rate, Rate limit
  4. Business layer: Cost overrun, SLA violations
Monitoring Best Practices
  1. Cross-layer metric correlation: Analyze correlations — LLM request increase -> GPU utilization rise -> infrastructure load increase
  2. Anomaly detection: When P99 latency suddenly increases, simultaneously check GPU temperature and memory usage
  3. Capacity planning: Consider provisioning additional GPU nodes when average GPU utilization exceeds 70%
  4. Cost optimization: Prioritize models with lower TTFT to improve user experience + increase throughput

7. Cost Tracking

Cost Tracking Concepts

Track LLM usage costs by the following criteria:

  • Per-model: Total cost and request count per model, identifying the most expensive models
  • Per-tenant: Per-tenant/team daily token usage and budget utilization
  • Per-time: Peak time analysis, cost trends

Per-Model Cost Reference (2026-04 baseline)1

TierModelInput ($/1M tok)Output ($/1M tok)Features
FrontierClaude Opus 4.7$15$75Highest quality reasoning
FrontierGPT-4.1 / o3$10$30Complex reasoning
FrontierGemini 2.5 Pro$1.25$5Enhanced multimodal
BalancedClaude Sonnet 4.6$3$15Quality-cost balance
BalancedGPT-4.1 mini$0.40$1.60Fast inference
BalancedGemini 2.5 Flash$0.10$0.40High throughput
Fast/CheapClaude Haiku 4.5$0.80$4Simple tasks
Fast/CheapGPT-4.1 nano / o4-mini$0.15$0.60Ultra-low cost
Fast/CheapGemini 2.5 Flash-Lite$0.05$0.20Minimal latency
Open-weightDeepSeek V3.1Self-hostedSelf-hostedOpen license
Open-weightLlama 4 ScoutSelf-hostedSelf-hostedMeta official
Open-weightQwen3-72BSelf-hostedSelf-hostedAlibaba Cloud
Cost Optimization Tips
  1. Model selection optimization: Use cheaper models (GPT-4.1 nano, Haiku 4.5, Gemini 2.5 Flash-Lite) for simple tasks
  2. Prompt optimization: Reduce input tokens by removing unnecessary context
  3. Caching: Cache responses for repetitive queries (Prompt Caching, Semantic Caching)
  4. Cascade Routing: Try low-cost model first, fallback to high-performance model on failure — 66% cost savings possible
  5. Open-weight models: Convert to fixed costs with DeepSeek V3.1, Llama 4, Qwen3 when self-hosting

8. Operations Checklist

Daily Checks

Daily Checks
Check ItemHow to CheckNormal Status
GPU Status`kubectl get nodes -l nvidia.com/gpu.present=true`All nodes Ready
Model Pods`kubectl get pods -n inference`Running state
Error RateGrafana dashboard< 1%
Response TimeP99 latency< 5 seconds
GPU UtilizationDCGM metrics40-80%
Memory UsageGPU memory< 90%

Weekly Checks

Weekly Checks
Check ItemHow to CheckAction
Cost AnalysisKubecost reportIdentify anomalous costs
Capacity PlanningResource trendsPlan scaling
Security PatchesImage scanPatch vulnerabilities
Backup ValidationRecovery testVerify backup policy

9. Monitoring Maturity Model

Monitoring Maturity Model
Level 1
Basic
Log collection, basic metrics
Level 2
Standard
Langfuse/LangSmith tracing, Grafana dashboard
Level 3
Advanced
Cost tracking, quality assessment, automated alerts
Level 4
Optimized
A/B testing, auto-tuning, predictive analytics

10. Next Steps

References

Footnotes

  1. As of 2026-04-17. For latest pricing, see official pricing pages: OpenAI Pricing, Anthropic Pricing, Google AI Pricing