AI Agent Monitoring and Operations

This document covers the monitoring architecture, key metric design, and alerting strategy for Agentic AI applications at a conceptual level.

Production Deployment Guide

For Langfuse Helm deployment, AMP/AMG configuration, ServiceMonitor YAML, and Grafana dashboard JSON, see the Monitoring Stack Setup Guide.

1. Overview

Agentic AI applications perform complex reasoning chains and various tool calls, making it difficult to achieve sufficient visibility with traditional APM (Application Performance Monitoring) tools alone. LLM-specialized observability tools like Langfuse and LangSmith provide the following core capabilities:

Trace tracking: Full flow tracking of LLM calls, tool execution, and agent reasoning processes
Token usage analysis: Input/output token counts and cost calculation
Quality evaluation: Response quality scoring and feedback collection
Debugging: Problem diagnosis through prompt and response content review

Target Audience

This document is intended for platform operators, MLOps engineers, and AI developers. Basic understanding of Kubernetes and Python is required.

2. Monitoring Architecture

Langfuse Architecture Overview

Langfuse v3.162.0+ consists of the following components:

AMP/AMG Integrated Data Flow

Monitoring Data Layers

Layer	Collection Tool	Metric Pattern	Visible Items
LLM Inference	Langfuse	trace, generation	Token usage, cost, TTFT, per-user patterns
Model Server	vLLM Prometheus	`vllm_*`	Request count, batch size, KV cache utilization, TPS
GPU	DCGM Exporter	`DCGM_FI_DEV_*`	GPU utilization, temperature, power, memory usage
Infrastructure	Node Exporter	`node_*`	CPU, memory, network, disk I/O
Gateway	kgateway	`envoy_*`	Request count, latency, error rate, upstream status

3. Langfuse vs LangSmith Comparison

Langfuse vs LangSmith Comparison

Feature	Langfuse	LangSmith
License	Open source (MIT)	Commercial (free tier)
Deployment	Self-hosted / Cloud	Cloud only
Data Sovereignty	Full control	LangChain servers
Integration	Multiple frameworks	LangChain optimized
Cost	Infrastructure only	Usage-based pricing
Scalability	Kubernetes native	Managed

Selection Guide

Langfuse: When data sovereignty is important or cost optimization is needed
LangSmith: When LangChain-based development is the focus and quick start is needed

AWS Native Observability: CloudWatch Generative AI Observability

Amazon CloudWatch Generative AI Observability is an AWS-native solution for LLM and AI agent monitoring:

Infrastructure-agnostic monitoring: Supports AI workloads across Bedrock, EKS, ECS, on-premises, and more
Agent/tool tracking: Built-in views for agents, knowledge bases, and tool calls
End-to-end tracing: Tracking across the entire AI stack
Framework compatibility: Support for external frameworks like LangChain, LangGraph, CrewAI

Using Langfuse v3.x (self-hosted data sovereignty) together with CloudWatch Gen AI Observability (AWS-native integration) provides the most comprehensive observability.

4. Key Monitoring Metrics

Defines the key metrics to track in Agentic AI applications.

Metric Categories

Latency Metrics

Metric	Description	Target	Alert Threshold
agent_request_duration_seconds	Total request processing time	P95 < 5s	P99 > 10s
llm_inference_duration_seconds	LLM inference time	P95 < 3s	P99 > 8s
tool_execution_duration_seconds	Tool execution time	P95 < 1s	P99 > 3s
vector_search_duration_seconds	Vector search time	P95 < 200ms	P99 > 500ms

Token Usage Metrics

Metric	Description	Monitoring Purpose
llm_input_tokens_total	Total input tokens	Prompt optimization
llm_output_tokens_total	Total output tokens	Response length analysis
llm_total_tokens_total	Total tokens	Cost tracking
llm_cost_dollars_total	Estimated cost (USD)	Budget management

Error Rate Metrics

Metric	Description	Alert Threshold
agent_errors_total	Total agent errors	Error rate > 5%
llm_rate_limit_errors_total	Rate Limit errors	> 10 per minute
tool_execution_errors_total	Tool execution errors	Error rate > 10%
agent_timeout_total	Timeout occurrences	> 5 per minute

5. PromQL Query Reference

GPU Metrics

# Overall GPU average utilization
avg(DCGM_FI_DEV_GPU_UTIL)

# Per-node GPU utilization
avg(DCGM_FI_DEV_GPU_UTIL) by (Hostname)

# GPU memory utilization
avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE * 100) by (gpu)

vLLM Metrics

# Overall TPS (tokens generated per second)
rate(vllm_generation_tokens_total[5m])

# Per-model TPS
sum(rate(vllm_generation_tokens_total[5m])) by (model)

# TTFT P99 (Time to First Token)
histogram_quantile(0.99, rate(vllm_time_to_first_token_seconds_bucket[5m]))

# TTFT P95
histogram_quantile(0.95, rate(vllm_time_to_first_token_seconds_bucket[5m]))

# E2E Latency P99
histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m]))

# Average batch size
avg(vllm_num_requests_running)

Gateway Metrics

# 5xx error rate (%)
rate(envoy_http_downstream_rq_xx{envoy_response_code_class="5"}[5m]) 
/ 
rate(envoy_http_downstream_rq_total[5m]) * 100

# Upstream health check failure rate
sum(rate(envoy_cluster_upstream_cx_connect_fail[5m])) by (envoy_cluster_name)

Cost Metrics

# Daily total cost
sum(increase(llm_cost_dollars_total[24h]))

# Per-tenant daily cost
sum(increase(llm_cost_dollars_total[24h])) by (tenant_id)

# Per-model cost ratio
sum(increase(llm_cost_dollars_total[24h])) by (model)
/ ignoring(model) group_left
sum(increase(llm_cost_dollars_total[24h]))

# Budget utilization (monthly)
sum(increase(llm_cost_dollars_total[30d])) by (tenant_id)
/ on(tenant_id) group_left
tenant_monthly_budget_usd

6. Alerting Strategy

Alert Threshold Design

Alert	Condition	Severity	Duration
Agent High Latency	P99 latency > 10s	Warning	5 min
Agent High Error Rate	Error rate > 5%	Critical	5 min
LLM Rate Limit	Rate limit errors > 10/5min	Warning	2 min
Daily Cost Budget	Daily cost > $100	Warning	Immediate
GPU High Temperature	GPU temp > 85C	Warning	5 min
GPU Memory Full	GPU memory > 95%	Critical	3 min
vLLM High Latency	P99 E2E latency > 30s	Warning	5 min

Alert Hierarchy

Infrastructure layer: GPU temperature, memory, power anomalies
Model server layer: vLLM latency increase, KV cache shortage
Application layer: Agent error rate, Rate limit
Business layer: Cost overrun, SLA violations

Monitoring Best Practices

Cross-layer metric correlation: Analyze correlations — LLM request increase -> GPU utilization rise -> infrastructure load increase
Anomaly detection: When P99 latency suddenly increases, simultaneously check GPU temperature and memory usage
Capacity planning: Consider provisioning additional GPU nodes when average GPU utilization exceeds 70%
Cost optimization: Prioritize models with lower TTFT to improve user experience + increase throughput

7. Cost Tracking

Cost Tracking Concepts

Track LLM usage costs by the following criteria:

Per-model: Total cost and request count per model, identifying the most expensive models
Per-tenant: Per-tenant/team daily token usage and budget utilization
Per-time: Peak time analysis, cost trends

Per-Model Cost Reference (2026-04 baseline)¹

Tier	Model	Input ($/1M tok)	Output ($/1M tok)	Features
Frontier	Claude Opus 4.7	$15	$75	Highest quality reasoning
Frontier	GPT-4.1 / o3	$10	$30	Complex reasoning
Frontier	Gemini 2.5 Pro	$1.25	$5	Enhanced multimodal
Balanced	Claude Sonnet 4.6	$3	$15	Quality-cost balance
Balanced	GPT-4.1 mini	$0.40	$1.60	Fast inference
Balanced	Gemini 2.5 Flash	$0.10	$0.40	High throughput
Fast/Cheap	Claude Haiku 4.5	$0.80	$4	Simple tasks
Fast/Cheap	GPT-4.1 nano / o4-mini	$0.15	$0.60	Ultra-low cost
Fast/Cheap	Gemini 2.5 Flash-Lite	$0.05	$0.20	Minimal latency
Open-weight	DeepSeek V3.1	Self-hosted	Self-hosted	Open license
Open-weight	Llama 4 Scout	Self-hosted	Self-hosted	Meta official
Open-weight	Qwen3-72B	Self-hosted	Self-hosted	Alibaba Cloud

Cost Optimization Tips

Model selection optimization: Use cheaper models (GPT-4.1 nano, Haiku 4.5, Gemini 2.5 Flash-Lite) for simple tasks
Prompt optimization: Reduce input tokens by removing unnecessary context
Caching: Cache responses for repetitive queries (Prompt Caching, Semantic Caching)
Cascade Routing: Try low-cost model first, fallback to high-performance model on failure — 66% cost savings possible
Open-weight models: Convert to fixed costs with DeepSeek V3.1, Llama 4, Qwen3 when self-hosting

8. Operations Checklist

Daily Checks

Check Item	How to Check	Normal Status
GPU Status	`kubectl get nodes -l nvidia.com/gpu.present=true`	All nodes Ready
Model Pods	`kubectl get pods -n inference`	Running state
Error Rate	Grafana dashboard	< 1%
Response Time	P99 latency	< 5 seconds
GPU Utilization	DCGM metrics	40-80%
Memory Usage	GPU memory	< 90%

Weekly Checks

Check Item	How to Check	Action
Cost Analysis	Kubecost report	Identify anomalous costs
Capacity Planning	Resource trends	Plan scaling
Security Patches	Image scan	Patch vulnerabilities
Backup Validation	Recovery test	Verify backup policy

9. Monitoring Maturity Model

Monitoring Maturity Model

Level 1

Basic

Log collection, basic metrics

Level 2

Standard

Langfuse/LangSmith tracing, Grafana dashboard

Level 3

Advanced

Cost tracking, quality assessment, automated alerts

Level 4

Optimized

A/B testing, auto-tuning, predictive analytics

10. Next Steps

Monitoring Stack Setup Guide - AMP/AMG deployment, Langfuse Helm installation, ServiceMonitor, Grafana dashboard production setup
LLMOps Observability Comparison Guide - In-depth comparison of Langfuse vs LangSmith vs Helicone
Agentic AI Platform Architecture - Overall platform design
RAG Evaluation Framework - Quality evaluation with Ragas

References

As of 2026-04-17. For latest pricing, see official pricing pages: OpenAI Pricing, Anthropic Pricing, Google AI Pricing ↩

1. Overview​

2. Monitoring Architecture​

Langfuse Architecture Overview​

AMP/AMG Integrated Data Flow​

Monitoring Data Layers​

3. Langfuse vs LangSmith Comparison​

AWS Native Observability: CloudWatch Generative AI Observability​

4. Key Monitoring Metrics​

Metric Categories​

Latency Metrics​

Token Usage Metrics​

Error Rate Metrics​

5. PromQL Query Reference​

GPU Metrics​

vLLM Metrics​

Gateway Metrics​

Cost Metrics​

6. Alerting Strategy​

Alert Threshold Design​

Alert Hierarchy​

7. Cost Tracking​

Cost Tracking Concepts​

Per-Model Cost Reference (2026-04 baseline)1​

8. Operations Checklist​

Daily Checks​

Weekly Checks​

9. Monitoring Maturity Model​

10. Next Steps​

References​

Footnotes​