RAG Evaluation Framework
You cannot improve what you cannot measure. RAG evaluation is uniquely challenging because you need to assess both retrieval quality and generation quality independently. This lesson covers the metrics, tools, and testing strategies that production RAG teams use to measure and improve system quality.
The Four Core RAG Metrics
Every RAG system should be measured on these four dimensions. Together, they tell you whether the right documents are being retrieved and whether the generated answer faithfully represents those documents.
| Metric | What It Measures | Low Score Means | How to Fix |
|---|---|---|---|
| Faithfulness | Is the answer supported by the retrieved context? | The LLM is hallucinating (making up information) | Improve system prompt, add self-consistency checks |
| Answer Relevancy | Does the answer actually address the user's question? | The answer is off-topic or incomplete | Improve retrieval (better chunks are being found) |
| Context Precision | How much of the retrieved context is actually relevant? | The retriever is returning irrelevant chunks | Add re-ranking, improve chunking, tune hybrid search |
| Context Recall | Did the retriever find all the relevant documents? | Relevant documents exist but were not retrieved | Increase top-K, add multi-query, improve embeddings |
Automated Evaluation with RAGAS
RAGAS (Retrieval-Augmented Generation Assessment) is the standard open-source framework for evaluating RAG systems. It uses LLMs to compute the four core metrics automatically, without human annotation.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Prepare your evaluation dataset
eval_data = {
"question": [
"What is the refund policy for enterprise plans?",
"How do I reset my API key?",
"What are the SLA guarantees?",
],
"answer": [
"Enterprise plans offer a 30-day full refund...",
"Navigate to Settings > API Keys > Regenerate...",
"We guarantee 99.9% uptime for enterprise tier...",
],
"contexts": [
["Enterprise refund policy: Full refund within 30 days..."],
["API Key Management: Go to Settings > API Keys..."],
["SLA Terms: 99.9% uptime guarantee for enterprise..."],
],
"ground_truth": [
"Enterprise plans have a 30-day full refund window.",
"Go to Settings, then API Keys, click Regenerate.",
"Enterprise SLA guarantees 99.9% uptime.",
],
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation
results = evaluate(
dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
],
)
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.89,
# 'context_precision': 0.85, 'context_recall': 0.88}
Building an Evaluation Dataset
The hardest part of RAG evaluation is creating a high-quality evaluation dataset. Here are three approaches:
Approach 1: Manual Curation
Have domain experts write 50–100 question-answer pairs with the expected source documents. This gives the highest quality dataset but is slow and expensive.
Approach 2: LLM-Generated Test Cases
from ragas.testset.generator import TestsetGenerator
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
generator = TestsetGenerator.from_langchain(
generator_llm=ChatOpenAI(model="gpt-4o"),
critic_llm=ChatOpenAI(model="gpt-4o"),
embeddings=OpenAIEmbeddings(),
)
# Generate test questions from your actual documents
testset = generator.generate_with_langchain_docs(
documents=your_documents, # The docs in your RAG knowledge base
test_size=50,
distributions={
"simple": 0.3, # Simple factual questions
"reasoning": 0.4, # Questions requiring multi-hop reasoning
"multi_context": 0.3 # Questions requiring multiple source docs
}
)
test_df = testset.to_pandas()
# Columns: question, ground_truth, contexts, evolution_type
Approach 3: Production Query Sampling
Log real user queries, sample representative ones, and have domain experts annotate the expected answers. This gives you the most realistic test set.
def build_eval_set_from_production(query_log: list[dict], sample_size=50):
"""Sample production queries and prepare for annotation."""
import random
sampled = random.sample(query_log, min(sample_size, len(query_log)))
eval_set = []
for entry in sampled:
eval_set.append({
"question": entry["query"],
"rag_answer": entry["response"],
"retrieved_contexts": entry["contexts"],
"ground_truth": None, # To be filled by domain expert
"annotator_notes": "",
})
return eval_set # Export to CSV for annotation
A/B Testing RAG Systems
When you change a retrieval strategy, chunking approach, or embedding model, you need to measure the impact on real users. A/B testing lets you compare two RAG configurations in production.
import random
import logging
class RAGABTest:
def __init__(self, config_a: dict, config_b: dict, split_ratio=0.5):
self.config_a = config_a # e.g., {"retriever": "naive", "top_k": 5}
self.config_b = config_b # e.g., {"retriever": "hybrid+rerank", "top_k": 10}
self.split_ratio = split_ratio
self.logger = logging.getLogger("rag_ab_test")
def route_query(self, query: str, user_id: str) -> str:
"""Deterministically route user to A or B based on user_id."""
# Hash-based routing ensures same user always gets same variant
bucket = hash(user_id) % 100
variant = "A" if bucket < self.split_ratio * 100 else "B"
config = self.config_a if variant == "A" else self.config_b
answer, metrics = self.execute_rag(query, config)
self.logger.info({
"user_id": user_id,
"variant": variant,
"query": query,
"latency_ms": metrics["latency_ms"],
"num_chunks_retrieved": metrics["num_chunks"],
"faithfulness_score": metrics.get("faithfulness"),
})
return answer
def analyze_results(self, days=7) -> dict:
"""Compare metrics between variants A and B."""
# Pull logs, compute: avg faithfulness, avg relevancy,
# user satisfaction (thumbs up/down), latency p50/p99
pass
Regression Testing
Every change to your RAG system (new documents, updated embeddings, prompt changes) can degrade quality. Regression testing catches these regressions before they reach users.
# regression_test.py - Run on every deployment
import json
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
def run_regression_tests(rag_pipeline, test_file="eval_dataset.json"):
"""Run evaluation suite and fail if quality drops below thresholds."""
with open(test_file) as f:
test_cases = json.load(f)
# Run each test case through the current RAG pipeline
results = []
for case in test_cases:
answer, contexts = rag_pipeline.query(case["question"])
results.append({
"question": case["question"],
"answer": answer,
"contexts": [c["text"] for c in contexts],
"ground_truth": case["ground_truth"],
})
# Compute metrics
scores = evaluate(Dataset.from_list(results),
metrics=[faithfulness, answer_relevancy])
# Assert quality thresholds
assert scores["faithfulness"] >= 0.85, \
f"Faithfulness dropped to {scores['faithfulness']:.2f} (threshold: 0.85)"
assert scores["answer_relevancy"] >= 0.80, \
f"Relevancy dropped to {scores['answer_relevancy']:.2f} (threshold: 0.80)"
print(f"Regression tests passed: faithfulness={scores['faithfulness']:.2f}, "
f"relevancy={scores['answer_relevancy']:.2f}")
# Add to CI/CD pipeline:
# python regression_test.py --test-file golden_set.json
Key Takeaways
- Measure four core metrics: faithfulness, answer relevancy, context precision, and context recall.
- Use RAGAS for automated evaluation — it computes all four metrics without human annotation.
- Build evaluation datasets from three sources: manual curation, LLM-generated tests, and production query sampling.
- A/B test retrieval changes with hash-based user routing and at least 200 queries per variant.
- Add regression tests to your CI/CD pipeline to catch quality drops before deployment.
- Target benchmarks: faithfulness >0.90, relevancy >0.85, precision >0.80, recall >0.85.
Lilly Tech Systems