AI Agent 모니터링 및 운영
이 문서에서는 LangFuse와 LangSmith를 활용하여 Agentic AI 애플리케이션의 성능과 동작을 효과적으로 추적하고 모니터링하는 방법을 다룹니다. Kubernetes 환경에서의 배포부터 Grafana 대시보드 구성, 알림 설정, 그리고 트러블슈팅까지 실무에 필요한 전체 운영 가이드를 제공합니다.
개요
Agentic AI 애플리케이션은 복잡한 추론 체인과 다양한 도구 호출을 수행하기 때문에, 전통적인 APM(Application Performance Monitoring) 도구만으로는 충분한 가시성을 확보하기 어렵습니다. LLM 특화 관측성 도구인 LangFuse와 LangSmith는 다음과 같은 핵심 기능을 제공합니다:
- 트레이스 추적: LLM 호출, 도구 실행, 에이전트 추론 과정의 전체 흐름 추적
- 토큰 사용량 분석: 입력/출력 토큰 수 및 비용 계산
- 품질 평가: 응답 품질 점수화 및 피드백 수집
- 디버깅: 프롬프트 및 응답 내용 검토를 통한 문제 진단
이 문서는 플랫폼 운영자, MLOps 엔지니어, AI 개발자를 대상으로 합니다. Kubernetes와 Python에 대한 기본적인 이해가 필요합니다.
LangFuse vs LangSmith 비교
| Feature | LangFuse | LangSmith |
|---|---|---|
| License | Open source (MIT) | Commercial (free tier) |
| Deployment | Self-hosted / Cloud | Cloud only |
| Data Sovereignty | Full control | LangChain servers |
| Integration | Multiple frameworks | LangChain optimized |
| Cost | Infrastructure only | Usage-based pricing |
| Scalability | Kubernetes native | Managed |
- LangFuse: 데이터 주권이 중요하거나, 비용 최적화가 필요한 경우
- LangSmith: LangChain 기반 개발이 주력이고, 빠른 시작이 필요한 경우
LangFuse Kubernetes 배포
아키텍처 개요
LangFuse v2.75.0 이상은 다음 컴포넌트로 구성됩니다:
PostgreSQL 배포
LangFuse의 메타데이터 저장을 위한 PostgreSQL을 배포합니다.
# langfuse-postgres.yaml
apiVersion: v1
kind: Namespace
metadata:
name: observability
labels:
app.kubernetes.io/part-of: langfuse
---
apiVersion: v1
kind: Secret
metadata:
name: langfuse-postgres-secret
namespace: observability
type: Opaque
stringData:
POSTGRES_USER: langfuse
POSTGRES_PASSWORD: "your-secure-password-here" # 프로덕션에서는 Secrets Manager 사용
POSTGRES_DB: langfuse
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: langfuse-postgres-pvc
namespace: observability
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3
resources:
requests:
storage: 100Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: langfuse-postgres
namespace: observability
spec:
serviceName: langfuse-postgres
replicas: 1
selector:
matchLabels:
app: langfuse-postgres
template:
metadata:
labels:
app: langfuse-postgres
spec:
containers:
- name: postgres
image: postgres:15-alpine
ports:
- containerPort: 5432
envFrom:
- secretRef:
name: langfuse-postgres-secret
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
livenessProbe:
exec:
command:
- pg_isready
- -U
- langfuse
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
exec:
command:
- pg_isready
- -U
- langfuse
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: postgres-data
persistentVolumeClaim:
claimName: langfuse-postgres-pvc
---
apiVersion: v1
kind: Service
metadata:
name: langfuse-postgres
namespace: observability
spec:
selector:
app: langfuse-postgres
ports:
- port: 5432
targetPort: 5432
clusterIP: None
LangFuse Deployment
LangFuse 애플리케이션을 배포합니다.
# langfuse-deployment.yaml
apiVersion: v1
kind: Secret
metadata:
name: langfuse-secret
namespace: observability
type: Opaque
stringData:
# 필수 환경 변수
DATABASE_URL: "postgresql://langfuse:your-secure-password-here@langfuse-postgres:5432/langfuse"
NEXTAUTH_SECRET: "your-nextauth-secret-32-chars-min" # openssl rand -base64 32
SALT: "your-salt-value-here" # openssl rand -base64 32
ENCRYPTION_KEY: "0000000000000000000000000000000000000000000000000000000000000000" # 64 hex chars
# 선택적 환경 변수
NEXTAUTH_URL: "https://langfuse.your-domain.com"
LANGFUSE_ENABLE_EXPERIMENTAL_FEATURES: "true"
# S3 설정 (선택적)
S3_ENDPOINT: "https://s3.ap-northeast-2.amazonaws.com"
S3_ACCESS_KEY_ID: "your-access-key"
S3_SECRET_ACCESS_KEY: "your-secret-key"
S3_BUCKET_NAME: "langfuse-traces"
S3_REGION: "ap-northeast-2"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: langfuse
namespace: observability
labels:
app: langfuse
spec:
replicas: 2
selector:
matchLabels:
app: langfuse
template:
metadata:
labels:
app: langfuse
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3000"
prometheus.io/path: "/api/public/metrics"
spec:
containers:
- name: langfuse
image: langfuse/langfuse:2.75.0
ports:
- containerPort: 3000
name: http
envFrom:
- secretRef:
name: langfuse-secret
env:
- name: NODE_ENV
value: "production"
- name: PORT
value: "3000"
- name: HOSTNAME
value: "0.0.0.0"
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
livenessProbe:
httpGet:
path: /api/public/health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /api/public/health
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: langfuse
topologyKey: kubernetes.io/hostname
---
apiVersion: v1
kind: Service
metadata:
name: langfuse
namespace: observability
spec:
selector:
app: langfuse
ports:
- port: 80
targetPort: 3000
name: http
type: ClusterIP
Ingress 설정
외부 접근을 위한 Ingress를 구성합니다.
# langfuse-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: langfuse-ingress
namespace: observability
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:ap-northeast-2:XXXXXXXXXXXX:certificate/xxx
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/ssl-redirect: "443"
alb.ingress.kubernetes.io/healthcheck-path: /api/public/health
alb.ingress.kubernetes.io/healthcheck-interval-seconds: "15"
alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "5"
alb.ingress.kubernetes.io/healthy-threshold-count: "2"
alb.ingress.kubernetes.io/unhealthy-threshold-count: "2"
spec:
ingressClassName: alb
rules:
- host: langfuse.your-domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: langfuse
port:
number: 80
HPA 설정
트래픽에 따른 자동 스케일링을 구성합니다.
# langfuse-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: langfuse-hpa
namespace: observability
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: langfuse
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
NEXTAUTH_SECRET,SALT,ENCRYPTION_KEY는 반드시 안전한 랜덤 값으로 설정하세요- 프로덕션에서는 AWS Secrets Manager 또는 HashiCorp Vault를 사용하여 시크릿을 관리하세요
- PostgreSQL은 Amazon RDS PostgreSQL을 사용하는 것을 강력히 권장합니다 (고가용성, 자동 백업, 관리형 업데이트)
- StatefulSet PostgreSQL은 개발/테스트 환경에만 사용하세요
AWS Secrets Manager 통합 (권장)
프로덕션 환경에서는 Kubernetes Secret 대신 AWS Secrets Manager와 External Secrets Operator를 사용하세요:
# external-secrets-operator 설치
helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets -n external-secrets-system --create-namespace
# SecretStore 설정
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: aws-secrets-manager
namespace: observability
spec:
provider:
aws:
service: SecretsManager
region: ap-northeast-2
auth:
jwt:
serviceAccountRef:
name: langfuse
# ExternalSecret 설정
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: langfuse-secret
namespace: observability
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: langfuse-secret
creationPolicy: Owner
data:
- secretKey: DATABASE_URL
remoteRef:
key: langfuse/database-url
- secretKey: NEXTAUTH_SECRET
remoteRef:
key: langfuse/nextauth-secret
- secretKey: SALT
remoteRef:
key: langfuse/salt
- secretKey: ENCRYPTION_KEY
remoteRef:
key: langfuse/encryption-key
Amazon RDS PostgreSQL 사용 (권장)
프로덕션 환경에서는 StatefulSet PostgreSQL 대신 Amazon RDS를 사용하세요:
# RDS PostgreSQL 연결 설정
apiVersion: v1
kind: Secret
metadata:
name: langfuse-postgres-secret
namespace: observability
type: Opaque
stringData:
DATABASE_URL: "postgresql://langfuse:password@langfuse-db.xxxxxxxxxxxx.ap-northeast-2.rds.amazonaws.com:5432/langfuse?sslmode=require"
RDS 장점:
- 자동 백업 및 포인트인타임 복구
- Multi-AZ 고가용성
- 자동 패치 및 업데이트
- 성능 인사이트 및 모니터링
- 읽기 전용 복제본 지원
LangSmith 통합
LangSmith는 LangChain에서 제공하는 관리형 관측성 플랫폼입니다. Self-hosted 옵션이 없지만, LangChain 기반 애플리케이션과의 통합이 매우 간편합니다.
환경 설정
LangSmith를 사용하기 위한 환경 변수를 설정합니다.
# langsmith-config.yaml
apiVersion: v1
kind: Secret
metadata:
name: langsmith-config
namespace: ai-agents
type: Opaque
stringData:
LANGCHAIN_TRACING_V2: "true"
LANGCHAIN_ENDPOINT: "https://api.smith.langchain.com"
LANGCHAIN_API_KEY: "ls__your-api-key-here"
LANGCHAIN_PROJECT: "agentic-ai-production"