Ragas RAG Evaluation Framework

Ragas (RAG Assessment) is an open-source framework for objectively evaluating the quality of RAG (Retrieval-Augmented Generation) pipelines. It is essential for measuring and continuously improving RAG system performance in Agentic AI platforms.

1. Overview

Why RAG Evaluation Is Needed

RAG systems consist of multiple components (retrieval, generation, context processing), making it difficult to measure overall quality:

Ragas vs AWS Bedrock RAG Evaluation

AWS Bedrock RAG Evaluation GA

AWS Bedrock RAG Evaluation became GA in March 2025. With Bedrock native integration, RAG evaluation can be performed without additional setup.

Ragas vs AWS Bedrock RAG Evaluation

Comparison Item	Ragas (Open Source)	AWS Bedrock RAG Evaluation
Deployment Method	Self-hosted	AWS Managed
Evaluation LLM	External API (OpenAI, etc.)	Bedrock Models
Metrics	5 core metrics	4 core metrics
Customization	High (Python code)	Medium (API parameters)
Cost	LLM API cost	Bedrock call cost
Integration	Manual integration required	Bedrock native
Best For	Fine-grained control needed	Fast production deployment

AWS Bedrock RAG Evaluation Metrics:

Context Relevance: Whether retrieved context is relevant to the question
Coverage: Whether the answer covers all aspects of the question
Correctness: Whether the answer is accurate (compared to ground truth)
Faithfulness: Whether the answer is faithful to the context

Ragas Core Metrics

Faithfulness

Generation Quality

Whether answer is faithful to context

Answer Relevancy

Generation Quality

Whether answer is relevant to question

Context Precision

Retrieval Quality

Precision of retrieved context

Context Recall

Retrieval Quality

Whether required information is retrieved

Answer Correctness

Overall Quality

Answer accuracy

Ragas 0.2+ API Changes

In Ragas 0.2+, the context_relevancy metric has been removed. Use a combination of context_precision and context_recall for context quality evaluation.

2. Installation and Basic Setup

Python Environment Setup

# Install Ragas (0.2+ recommended)
pip install "ragas>=0.2" langchain-openai datasets

# Additional dependencies
pip install pandas numpy

Basic Evaluation Code

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": [
        "How is GPU scheduling done in Kubernetes?",
        "What are Karpenter's key features?",
    ],
    "answer": [
        "GPU scheduling in Kubernetes is performed through the NVIDIA Device Plugin...",
        "Karpenter provides automatic node provisioning, consolidation, and drift detection...",
    ],
    "contexts": [
        ["GPU scheduling is done through Device Plugin...", "NVIDIA GPU Operator..."],
        ["Karpenter is a Kubernetes node auto-scaler...", "Through NodePool CRD..."],
    ],
    "ground_truth": [
        "GPU resources are scheduled using NVIDIA Device Plugin and GPU Operator.",
        "Karpenter provides automatic node provisioning, consolidation, drift detection, and disruption handling.",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation (with error handling)
try:
    results = evaluate(
        dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
        ],
    )
    print(results)
except Exception as e:
    print(f"Error during evaluation: {e}")
    # Logging or retry logic

3. Core Metric Details

1. Faithfulness

Measures how faithful the answer is to the provided context. A key metric for detecting hallucination.

from ragas.metrics import faithfulness

# Faithfulness calculation process:
# 1. Decompose answer into individual claims
# 2. Verify each claim is inferable from context
# 3. Verified claims / Total claims = Faithfulness score

# Score interpretation:
# 1.0: All claims supported by context
# 0.5: Only half of claims supported by context
# 0.0: No claims supported by context (severe hallucination)

2. Answer Relevancy

Measures how relevant the answer is to the question.

from ragas.metrics import answer_relevancy

# Answer Relevancy calculation process:
# 1. Generate questions from the answer in reverse
# 2. Calculate similarity between generated and original questions
# 3. Repeat multiple times and calculate average

# Score interpretation:
# High score: Answer directly relates to question
# Low score: Answer contains content unrelated to question

3. Context Precision

Measures the proportion of actually useful information among retrieved contexts.

from ragas.metrics import context_precision

# Context Precision calculation:
# - Identify context needed to generate ground truth answer
# - Check if useful information exists in top-ranked context
# - Higher score when relevant context is in higher ranks

4. Context Recall

Measures whether the information needed to generate the correct answer is included in the retrieved context.

from ragas.metrics import context_recall

# Context Recall calculation:
# 1. Decompose ground truth into individual sentences
# 2. Check if each sentence is inferable from retrieved context
# 3. Inferable sentences / Total sentences = Recall score

4. Comprehensive Evaluation Pipeline

Full RAG System Evaluation

import os
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness,
)
from datasets import Dataset
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# LLM configuration (for evaluation)
os.environ["OPENAI_API_KEY"] = "your-api-key"

def evaluate_rag_pipeline(questions, rag_chain, ground_truths):
    """Comprehensive RAG pipeline evaluation"""
    
    answers = []
    contexts = []
    
    for question in questions:
        # Execute RAG chain
        result = rag_chain.invoke({"query": question})
        answers.append(result["result"])
        contexts.append([doc.page_content for doc in result["source_documents"]])
    
    # Construct evaluation dataset
    eval_dataset = Dataset.from_dict({
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truths,
    })
    
    # Evaluate with all metrics
    results = evaluate(
        eval_dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
            answer_correctness,
        ],
    )
    
    return results

# Usage example
questions = [
    "How to configure Karpenter on EKS?",
    "How to configure GPU node auto-scaling?",
    "How to set up Inference Gateway dynamic routing?",
]

ground_truths = [
    "Karpenter is installed via Helm chart and configured by defining NodePool CRD.",
    "Configure GPU usage-based scaling by integrating DCGM Exporter metrics with KEDA.",
    "Use Gateway API's HTTPRoute to configure weight-based traffic distribution.",
]

# Run evaluation
results = evaluate_rag_pipeline(questions, rag_chain, ground_truths)
print(results.to_pandas())

Evaluation Result Analysis

import pandas as pd
import matplotlib.pyplot as plt

def analyze_evaluation_results(results):
    """Analyze and visualize evaluation results"""
    
    df = results.to_pandas()
    
    # Average score per metric
    metrics_summary = df.mean(numeric_only=True)
    print("=== Average Score per Metric ===")
    print(metrics_summary)
    
    # Identify problem areas
    print("\n=== Areas Needing Improvement ===")
    for metric, score in metrics_summary.items():
        if score < 0.7:
            print(f"Warning {metric}: {score:.2f} - Needs improvement")
        elif score < 0.85:
            print(f"Info {metric}: {score:.2f} - Good")
        else:
            print(f"Success {metric}: {score:.2f} - Excellent")
    
    # Visualization
    fig, ax = plt.subplots(figsize=(10, 6))
    metrics_summary.plot(kind='bar', ax=ax, color=['#4285f4', '#34a853', '#fbbc04', '#ea4335', '#9c27b0', '#00bcd4'])
    ax.set_ylabel('Score')
    ax.set_title('RAG Pipeline Evaluation Results')
    ax.set_ylim(0, 1)
    ax.axhline(y=0.7, color='r', linestyle='--', label='Minimum Threshold')
    ax.legend()
    plt.tight_layout()
    plt.savefig('rag_evaluation_results.png')
    
    return metrics_summary

# Run analysis
summary = analyze_evaluation_results(results)

5. CI/CD Pipeline Integration

GitHub Actions Workflow

# .github/workflows/rag-evaluation.yml
name: RAG Pipeline Evaluation

on:
  push:
    paths:
      - 'src/rag/**'
      - 'data/knowledge_base/**'
  pull_request:
    paths:
      - 'src/rag/**'
  schedule:
    - cron: '0 0 * * *'  # Daily at midnight

jobs:
  evaluate:
    runs-on: ubuntu-latest
    
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: '3.11'
    
    - name: Install dependencies
      run: |
        pip install ragas langchain-openai datasets pandas
    
    - name: Run RAG Evaluation
      env:
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      run: |
        python scripts/evaluate_rag.py --output results/evaluation.json
    
    - name: Check Quality Gates
      run: |
        python scripts/check_quality_gates.py results/evaluation.json
    
    - name: Upload Results
      uses: actions/upload-artifact@v4
      with:
        name: evaluation-results
        path: results/
    
    - name: Comment PR with Results
      if: github.event_name == 'pull_request'
      uses: actions/github-script@v7
      with:
        script: |
          const fs = require('fs');
          const results = JSON.parse(fs.readFileSync('results/evaluation.json'));
          
          let comment = '## RAG Evaluation Results\n\n';
          comment += '| Metric | Score | Status |\n';
          comment += '|--------|-------|--------|\n';
          
          for (const [metric, score] of Object.entries(results.metrics)) {
            const status = score >= 0.7 ? 'Pass' : 'Warning';
            comment += `| ${metric} | ${score.toFixed(2)} | ${status} |\n`;
          }
          
          github.rest.issues.createComment({
            issue_number: context.issue.number,
            owner: context.repo.owner,
            repo: context.repo.repo,
            body: comment
          });

Quality Gate Script

# scripts/check_quality_gates.py
import json
import sys

QUALITY_GATES = {
    "faithfulness": 0.8,
    "answer_relevancy": 0.75,
    "context_precision": 0.7,
    "context_recall": 0.7,
}

def check_quality_gates(results_file):
    with open(results_file) as f:
        results = json.load(f)
    
    failed_gates = []
    
    for metric, threshold in QUALITY_GATES.items():
        score = results["metrics"].get(metric, 0)
        if score < threshold:
            failed_gates.append({
                "metric": metric,
                "score": score,
                "threshold": threshold,
            })
    
    if failed_gates:
        print("Quality gates failed:")
        for gate in failed_gates:
            print(f"  - {gate['metric']}: {gate['score']:.2f} < {gate['threshold']}")
        sys.exit(1)
    else:
        print("All quality gates passed!")
        sys.exit(0)

if __name__ == "__main__":
    check_quality_gates(sys.argv[1])

6. Kubernetes Job for Regular Evaluation

Evaluation Job Definition

apiVersion: batch/v1
kind: CronJob
metadata:
  name: rag-evaluation
  namespace: genai-platform
spec:
  schedule: "0 6 * * *"  # Daily at 6 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: evaluator
            image: your-registry/rag-evaluator:latest
            env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: openai-credentials
                  key: api-key
            - name: MILVUS_HOST
              value: "milvus-proxy.ai-data.svc.cluster.local"
            - name: RESULTS_BUCKET
              value: "s3://rag-evaluation-results"
            command:
            - python
            - /app/evaluate.py
            - --config=/app/config/evaluation.yaml
            - --output=s3
            resources:
              requests:
                cpu: "1"
                memory: "2Gi"
              limits:
                cpu: "2"
                memory: "4Gi"
          restartPolicy: OnFailure
          serviceAccountName: rag-evaluator

Evaluation Configuration ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: rag-evaluation-config
  namespace: genai-platform
data:
  evaluation.yaml: |
    evaluation:
      metrics:
        - faithfulness
        - answer_relevancy
        - context_precision
        - context_recall
      
      test_sets:
        - name: "general_knowledge"
          path: "s3://test-data/general.json"
          weight: 0.4
        - name: "technical_docs"
          path: "s3://test-data/technical.json"
          weight: 0.6
      
      quality_gates:
        faithfulness: 0.8
        answer_relevancy: 0.75
        context_precision: 0.7
        context_recall: 0.7
      
      alerts:
        slack_webhook: "https://hooks.slack.com/..."
        threshold_drop: 0.1  # Alert on 10%+ drop

7. Evaluation Result Interpretation and Improvement Guide

Cost Optimization Strategies

RAG evaluation requires LLM API calls, so costs are incurred. Optimize costs with the following strategies:

💰 Cost Optimization Strategies

Sampling Evaluation

Evaluate only representative samples instead of full dataset

50-80%

Caching

Cache and reuse identical question-answer pairs

30-50%

Batch Processing

Bundle multiple evaluations in batches

20-30%

Use Cheaper Model

Use GPT-3.5 instead of GPT-4 (accuracy trade-off)

90%

Incremental Evaluation

Re-evaluate only changed portions

60-90%

import hashlib
import json
from functools import lru_cache

class CachedEvaluator:
    """Cost-optimized evaluator with caching"""
    
    def __init__(self, cache_file='eval_cache.json'):
        self.cache_file = cache_file
        self.cache = self._load_cache()
    
    def _load_cache(self):
        try:
            with open(self.cache_file, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return {}
    
    def _save_cache(self):
        with open(self.cache_file, 'w') as f:
            json.dump(self.cache, f)
    
    def _get_cache_key(self, question, answer, contexts):
        """Generate unique key for evaluation item"""
        content = f"{question}|{answer}|{'|'.join(contexts)}"
        return hashlib.md5(content.encode()).hexdigest()
    
    def evaluate_with_cache(self, dataset, metrics):
        """Evaluation with caching"""
        cached_results = []
        new_items = []
        
        for item in dataset:
            cache_key = self._get_cache_key(
                item['question'], 
                item['answer'], 
                item['contexts']
            )
            
            if cache_key in self.cache:
                cached_results.append(self.cache[cache_key])
            else:
                new_items.append(item)
        
        # Evaluate only new items
        if new_items:
            new_dataset = Dataset.from_dict({
                k: [item[k] for item in new_items]
                for k in new_items[0].keys()
            })
            
            new_results = evaluate(new_dataset, metrics=metrics)
            
            # Update cache
            for item, result in zip(new_items, new_results):
                cache_key = self._get_cache_key(
                    item['question'], 
                    item['answer'], 
                    item['contexts']
                )
                self.cache[cache_key] = result
            
            self._save_cache()
            cached_results.extend(new_results)
        
        return cached_results

# Usage example
evaluator = CachedEvaluator()
results = evaluator.evaluate_with_cache(dataset, metrics)

AWS Bedrock RAG Evaluation Usage

AWS Bedrock RAG Evaluation provides simpler evaluation with Bedrock native integration:

import boto3

bedrock = boto3.client('bedrock-agent-runtime')

# Run RAG evaluation
response = bedrock.evaluate_rag(
    evaluationJobName='rag-eval-2026-02-13',
    evaluationDatasetLocation={
        's3Uri': 's3://my-bucket/eval-dataset.jsonl'
    },
    evaluationMetrics=[
        'CONTEXT_RELEVANCE',
        'COVERAGE',
        'CORRECTNESS',
        'FAITHFULNESS'
    ],
    modelId='anthropic.claude-3-sonnet-20240229-v1:0',
    outputDataConfig={
        's3Uri': 's3://my-bucket/eval-results/'
    }
)

job_id = response['evaluationJobId']

# Query evaluation results
result = bedrock.get_evaluation_job(evaluationJobId=job_id)
print(f"Status: {result['status']}")
print(f"Metrics: {result['metrics']}")

Bedrock RAG Evaluation Advantages:

Native integration with Bedrock models
S3-based large-scale batch evaluation
Automatic CloudWatch metric publishing
IAM-based access control
No separate infrastructure required

Cost Comparison (per 1000 evaluations):

Cost Comparison (Based on 1000 Evaluations)

Method	Expected Cost	Setup Complexity
Ragas + OpenAI GPT-4	$50-100	Medium
Ragas + OpenAI GPT-3.5	$5-10	Medium
Bedrock RAG Eval (Claude 3 Sonnet)	$20-40	Low
Bedrock RAG Eval (Claude 3 Haiku)	$5-15	Low

Per-Metric Improvement Directions

Improvement Checklist

🔧 Improvement Checklist

Faithfulness < 0.7

Possible Cause

LLM ignores context

Solution

✓ Emphasize "use context only" in prompt

Context Precision < 0.6

Possible Cause

Poor retrieval quality

Solution

✓ Upgrade embedding model, add re-ranking

Context Recall < 0.6

Possible Cause

Missing relevant docs

Solution

✓ Increase k, use hybrid search

Answer Relevancy < 0.7

Possible Cause

Rambling answers

Solution

✓ Structure prompt, specify output format

References

Official Documentation

Recommendations

Include at least 50 diverse questions in evaluation datasets
Use ground truths verified by domain experts
Track quality changes over time through regular evaluation

Cautions

Ragas evaluation requires LLM API calls, incurring costs
Use batch processing and caching for large-scale evaluations
Evaluation results may vary depending on the LLM used

1. Overview​

Why RAG Evaluation Is Needed​

Ragas vs AWS Bedrock RAG Evaluation​

Ragas Core Metrics​

2. Installation and Basic Setup​

Python Environment Setup​

Basic Evaluation Code​

3. Core Metric Details​

1. Faithfulness​

2. Answer Relevancy​

3. Context Precision​

4. Context Recall​

4. Comprehensive Evaluation Pipeline​

Full RAG System Evaluation​

Evaluation Result Analysis​

5. CI/CD Pipeline Integration​

GitHub Actions Workflow​

Quality Gate Script​

6. Kubernetes Job for Regular Evaluation​

Evaluation Job Definition​

Evaluation Configuration ConfigMap​

7. Evaluation Result Interpretation and Improvement Guide​

Cost Optimization Strategies​

AWS Bedrock RAG Evaluation Usage​

Per-Metric Improvement Directions​

Improvement Checklist​

References​

Official Documentation​

Related Documentation​

1. Overview

Why RAG Evaluation Is Needed

Ragas vs AWS Bedrock RAG Evaluation

Ragas Core Metrics

2. Installation and Basic Setup

Python Environment Setup

Basic Evaluation Code

3. Core Metric Details

1. Faithfulness

2. Answer Relevancy

3. Context Precision

4. Context Recall

4. Comprehensive Evaluation Pipeline

Full RAG System Evaluation

Evaluation Result Analysis

5. CI/CD Pipeline Integration

GitHub Actions Workflow

Quality Gate Script

6. Kubernetes Job for Regular Evaluation

Evaluation Job Definition

Evaluation Configuration ConfigMap

7. Evaluation Result Interpretation and Improvement Guide

Cost Optimization Strategies

AWS Bedrock RAG Evaluation Usage

Per-Metric Improvement Directions

Improvement Checklist

References

Official Documentation

Related Documentation