Advanced

RAG Evaluation Framework

You cannot improve what you cannot measure. RAG evaluation is uniquely challenging because you need to assess both retrieval quality and generation quality independently. This lesson covers the metrics, tools, and testing strategies that production RAG teams use to measure and improve system quality.

The Four Core RAG Metrics

Every RAG system should be measured on these four dimensions. Together, they tell you whether the right documents are being retrieved and whether the generated answer faithfully represents those documents.

MetricWhat It MeasuresLow Score MeansHow to Fix
Faithfulness Is the answer supported by the retrieved context? The LLM is hallucinating (making up information) Improve system prompt, add self-consistency checks
Answer Relevancy Does the answer actually address the user's question? The answer is off-topic or incomplete Improve retrieval (better chunks are being found)
Context Precision How much of the retrieved context is actually relevant? The retriever is returning irrelevant chunks Add re-ranking, improve chunking, tune hybrid search
Context Recall Did the retriever find all the relevant documents? Relevant documents exist but were not retrieved Increase top-K, add multi-query, improve embeddings
💡
Apply at work: When your RAG system gives bad answers, check these metrics in order: (1) Context Recall — are the right docs being found? (2) Context Precision — are irrelevant docs diluting the context? (3) Faithfulness — is the model staying true to the context? This order lets you isolate whether the problem is in retrieval or generation.

Automated Evaluation with RAGAS

RAGAS (Retrieval-Augmented Generation Assessment) is the standard open-source framework for evaluating RAG systems. It uses LLMs to compute the four core metrics automatically, without human annotation.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare your evaluation dataset
eval_data = {
    "question": [
        "What is the refund policy for enterprise plans?",
        "How do I reset my API key?",
        "What are the SLA guarantees?",
    ],
    "answer": [
        "Enterprise plans offer a 30-day full refund...",
        "Navigate to Settings > API Keys > Regenerate...",
        "We guarantee 99.9% uptime for enterprise tier...",
    ],
    "contexts": [
        ["Enterprise refund policy: Full refund within 30 days..."],
        ["API Key Management: Go to Settings > API Keys..."],
        ["SLA Terms: 99.9% uptime guarantee for enterprise..."],
    ],
    "ground_truth": [
        "Enterprise plans have a 30-day full refund window.",
        "Go to Settings, then API Keys, click Regenerate.",
        "Enterprise SLA guarantees 99.9% uptime.",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
    ],
)

print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.89,
#  'context_precision': 0.85, 'context_recall': 0.88}
📝
Production benchmark targets: For most enterprise RAG systems, aim for: Faithfulness > 0.90, Answer Relevancy > 0.85, Context Precision > 0.80, Context Recall > 0.85. These targets are achievable with the advanced retrieval strategies from Lesson 4.

Building an Evaluation Dataset

The hardest part of RAG evaluation is creating a high-quality evaluation dataset. Here are three approaches:

Approach 1: Manual Curation

Have domain experts write 50–100 question-answer pairs with the expected source documents. This gives the highest quality dataset but is slow and expensive.

Approach 2: LLM-Generated Test Cases

from ragas.testset.generator import TestsetGenerator
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator = TestsetGenerator.from_langchain(
    generator_llm=ChatOpenAI(model="gpt-4o"),
    critic_llm=ChatOpenAI(model="gpt-4o"),
    embeddings=OpenAIEmbeddings(),
)

# Generate test questions from your actual documents
testset = generator.generate_with_langchain_docs(
    documents=your_documents,  # The docs in your RAG knowledge base
    test_size=50,
    distributions={
        "simple": 0.3,      # Simple factual questions
        "reasoning": 0.4,   # Questions requiring multi-hop reasoning
        "multi_context": 0.3 # Questions requiring multiple source docs
    }
)

test_df = testset.to_pandas()
# Columns: question, ground_truth, contexts, evolution_type

Approach 3: Production Query Sampling

Log real user queries, sample representative ones, and have domain experts annotate the expected answers. This gives you the most realistic test set.

def build_eval_set_from_production(query_log: list[dict], sample_size=50):
    """Sample production queries and prepare for annotation."""
    import random
    sampled = random.sample(query_log, min(sample_size, len(query_log)))

    eval_set = []
    for entry in sampled:
        eval_set.append({
            "question": entry["query"],
            "rag_answer": entry["response"],
            "retrieved_contexts": entry["contexts"],
            "ground_truth": None,  # To be filled by domain expert
            "annotator_notes": "",
        })

    return eval_set  # Export to CSV for annotation

A/B Testing RAG Systems

When you change a retrieval strategy, chunking approach, or embedding model, you need to measure the impact on real users. A/B testing lets you compare two RAG configurations in production.

import random
import logging

class RAGABTest:
    def __init__(self, config_a: dict, config_b: dict, split_ratio=0.5):
        self.config_a = config_a  # e.g., {"retriever": "naive", "top_k": 5}
        self.config_b = config_b  # e.g., {"retriever": "hybrid+rerank", "top_k": 10}
        self.split_ratio = split_ratio
        self.logger = logging.getLogger("rag_ab_test")

    def route_query(self, query: str, user_id: str) -> str:
        """Deterministically route user to A or B based on user_id."""
        # Hash-based routing ensures same user always gets same variant
        bucket = hash(user_id) % 100
        variant = "A" if bucket < self.split_ratio * 100 else "B"

        config = self.config_a if variant == "A" else self.config_b
        answer, metrics = self.execute_rag(query, config)

        self.logger.info({
            "user_id": user_id,
            "variant": variant,
            "query": query,
            "latency_ms": metrics["latency_ms"],
            "num_chunks_retrieved": metrics["num_chunks"],
            "faithfulness_score": metrics.get("faithfulness"),
        })

        return answer

    def analyze_results(self, days=7) -> dict:
        """Compare metrics between variants A and B."""
        # Pull logs, compute: avg faithfulness, avg relevancy,
        # user satisfaction (thumbs up/down), latency p50/p99
        pass
💡
Apply at work: Run A/B tests for at least 1 week with at least 200 queries per variant before drawing conclusions. Track both automated metrics (faithfulness, latency) and user signals (thumbs up/down, follow-up questions, escalation rate).

Regression Testing

Every change to your RAG system (new documents, updated embeddings, prompt changes) can degrade quality. Regression testing catches these regressions before they reach users.

# regression_test.py - Run on every deployment
import json
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

def run_regression_tests(rag_pipeline, test_file="eval_dataset.json"):
    """Run evaluation suite and fail if quality drops below thresholds."""
    with open(test_file) as f:
        test_cases = json.load(f)

    # Run each test case through the current RAG pipeline
    results = []
    for case in test_cases:
        answer, contexts = rag_pipeline.query(case["question"])
        results.append({
            "question": case["question"],
            "answer": answer,
            "contexts": [c["text"] for c in contexts],
            "ground_truth": case["ground_truth"],
        })

    # Compute metrics
    scores = evaluate(Dataset.from_list(results),
                      metrics=[faithfulness, answer_relevancy])

    # Assert quality thresholds
    assert scores["faithfulness"] >= 0.85, \
        f"Faithfulness dropped to {scores['faithfulness']:.2f} (threshold: 0.85)"
    assert scores["answer_relevancy"] >= 0.80, \
        f"Relevancy dropped to {scores['answer_relevancy']:.2f} (threshold: 0.80)"

    print(f"Regression tests passed: faithfulness={scores['faithfulness']:.2f}, "
          f"relevancy={scores['answer_relevancy']:.2f}")

# Add to CI/CD pipeline:
# python regression_test.py --test-file golden_set.json

Key Takeaways

  • Measure four core metrics: faithfulness, answer relevancy, context precision, and context recall.
  • Use RAGAS for automated evaluation — it computes all four metrics without human annotation.
  • Build evaluation datasets from three sources: manual curation, LLM-generated tests, and production query sampling.
  • A/B test retrieval changes with hash-based user routing and at least 200 queries per variant.
  • Add regression tests to your CI/CD pipeline to catch quality drops before deployment.
  • Target benchmarks: faithfulness >0.90, relevancy >0.85, precision >0.80, recall >0.85.