Advanced

RAG Evaluation

Measure and improve RAG quality with industry-standard metrics and evaluation frameworks.

Core RAG Metrics

Metric	What It Measures	Low Score Means
Faithfulness	Is the answer supported by the retrieved context?	Model is hallucinating beyond the context
Answer Relevancy	Does the answer actually address the question?	Answer is off-topic or incomplete
Context Precision	Are the retrieved documents relevant?	Retriever is returning irrelevant chunks
Context Recall	Did retrieval find all relevant documents?	Important documents are being missed

📚

Two categories: Faithfulness and Answer Relevancy measure generation quality. Context Precision and Recall measure retrieval quality. If generation is bad, fix your prompt. If retrieval is bad, fix chunking, embeddings, or retrieval strategy.

RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) is the most popular open-source evaluation framework for RAG:

Python - RAGAS Evaluation

pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": [
        "What is the refund policy?",
        "How do I reset my password?",
    ],
    "answer": [
        "Returns are accepted within 30 days...",
        "Go to Settings > Security > Reset Password...",
    ],
    "contexts": [
        ["Our return policy allows 30-day returns..."],
        ["To reset your password, navigate to..."],
    ],
    "ground_truth": [
        "30-day return policy for unused items.",
        "Settings > Security > Reset Password.",
    ]
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy,
             context_precision, context_recall]
)

print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.85, 'context_recall': 0.90}

TruLens

TruLens provides real-time monitoring and evaluation of RAG applications:

Python - TruLens

pip install trulens-eval

from trulens_eval import Tru, TruChain, Feedback
from trulens_eval.feedback.provider import OpenAI

tru = Tru()
provider = OpenAI()

# Define feedback functions
f_relevance = Feedback(provider.relevance).on_input_output()
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons)

# Wrap your RAG chain for monitoring
tru_chain = TruChain(
    rag_chain,
    app_id="my_rag_v1",
    feedbacks=[f_relevance, f_groundedness]
)

# Run queries - metrics are automatically recorded
with tru_chain as recording:
    response = tru_chain.invoke("How do I deploy?")

# View results in dashboard
tru.run_dashboard()

LangSmith

LangSmith by LangChain provides tracing, evaluation, and monitoring for RAG pipelines:

Python - LangSmith

# Set environment variables
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_..."

# All LangChain calls are now automatically traced
# View traces at smith.langchain.com

# Create evaluation dataset
from langsmith import Client

client = Client()
dataset = client.create_dataset("rag-eval")

# Add examples
client.create_example(
    inputs={"question": "What is the refund policy?"},
    outputs={"answer": "30-day return policy..."},
    dataset_id=dataset.id
)

Building Evaluation Datasets

Collect Real Questions

Gather actual questions from users, support tickets, or search logs. These are the most realistic test cases.
Create Ground Truth Answers

Have domain experts write correct answers for each question. Include the source documents that contain the answer.
Cover Edge Cases

Include questions that should return "I don't know," questions with multiple valid answers, and questions that require information from multiple documents.
Aim for 50-200 Examples

Start with 50 diverse examples. Expand to 200+ for comprehensive evaluation. Quality matters more than quantity.

Debugging Poor Retrieval

🔍

Symptom: Irrelevant Results

Fix: Try different embedding models, add hybrid search, adjust chunk sizes, or add metadata filtering.

🚫

Symptom: Missing Documents

Fix: Increase top-k, add multi-query retrieval, check if the document was properly ingested and chunked.

💬

Symptom: Hallucinations

Fix: Strengthen the prompt to only use context, add reranking, reduce temperature, or use a more capable model.

📋

Symptom: Incomplete Answers

Fix: Increase context window budget, use parent-child chunking, or add a summarization step for long contexts.

✅

Evaluation is iterative: Run evaluations, identify the weakest metric, fix it, re-evaluate. Repeat until you reach your quality targets. Focus on one metric at a time.

What's Next?

The final lesson covers best practices for production RAG systems, optimization, scaling, and common pitfalls.

← Previous Generation Next → Best Practices

RAG Evaluation

Core RAG Metrics

RAGAS Framework

TruLens

LangSmith

Building Evaluation Datasets

Collect Real Questions

Create Ground Truth Answers

Cover Edge Cases

Aim for 50-200 Examples

Debugging Poor Retrieval

Symptom: Irrelevant Results

Symptom: Missing Documents

Symptom: Hallucinations

Symptom: Incomplete Answers

What's Next?