Advanced

RAG Evaluation

Measure and improve RAG quality with industry-standard metrics and evaluation frameworks.

Core RAG Metrics

Metric What It Measures Low Score Means
Faithfulness Is the answer supported by the retrieved context? Model is hallucinating beyond the context
Answer Relevancy Does the answer actually address the question? Answer is off-topic or incomplete
Context Precision Are the retrieved documents relevant? Retriever is returning irrelevant chunks
Context Recall Did retrieval find all relevant documents? Important documents are being missed
📚
Two categories: Faithfulness and Answer Relevancy measure generation quality. Context Precision and Recall measure retrieval quality. If generation is bad, fix your prompt. If retrieval is bad, fix chunking, embeddings, or retrieval strategy.

RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) is the most popular open-source evaluation framework for RAG:

Python - RAGAS Evaluation
pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": [
        "What is the refund policy?",
        "How do I reset my password?",
    ],
    "answer": [
        "Returns are accepted within 30 days...",
        "Go to Settings > Security > Reset Password...",
    ],
    "contexts": [
        ["Our return policy allows 30-day returns..."],
        ["To reset your password, navigate to..."],
    ],
    "ground_truth": [
        "30-day return policy for unused items.",
        "Settings > Security > Reset Password.",
    ]
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy,
             context_precision, context_recall]
)

print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.85, 'context_recall': 0.90}

TruLens

TruLens provides real-time monitoring and evaluation of RAG applications:

Python - TruLens
pip install trulens-eval

from trulens_eval import Tru, TruChain, Feedback
from trulens_eval.feedback.provider import OpenAI

tru = Tru()
provider = OpenAI()

# Define feedback functions
f_relevance = Feedback(provider.relevance).on_input_output()
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons)

# Wrap your RAG chain for monitoring
tru_chain = TruChain(
    rag_chain,
    app_id="my_rag_v1",
    feedbacks=[f_relevance, f_groundedness]
)

# Run queries - metrics are automatically recorded
with tru_chain as recording:
    response = tru_chain.invoke("How do I deploy?")

# View results in dashboard
tru.run_dashboard()

LangSmith

LangSmith by LangChain provides tracing, evaluation, and monitoring for RAG pipelines:

Python - LangSmith
# Set environment variables
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls_..."

# All LangChain calls are now automatically traced
# View traces at smith.langchain.com

# Create evaluation dataset
from langsmith import Client

client = Client()
dataset = client.create_dataset("rag-eval")

# Add examples
client.create_example(
    inputs={"question": "What is the refund policy?"},
    outputs={"answer": "30-day return policy..."},
    dataset_id=dataset.id
)

Building Evaluation Datasets

  1. Collect Real Questions

    Gather actual questions from users, support tickets, or search logs. These are the most realistic test cases.

  2. Create Ground Truth Answers

    Have domain experts write correct answers for each question. Include the source documents that contain the answer.

  3. Cover Edge Cases

    Include questions that should return "I don't know," questions with multiple valid answers, and questions that require information from multiple documents.

  4. Aim for 50-200 Examples

    Start with 50 diverse examples. Expand to 200+ for comprehensive evaluation. Quality matters more than quantity.

Debugging Poor Retrieval

🔍

Symptom: Irrelevant Results

Fix: Try different embedding models, add hybrid search, adjust chunk sizes, or add metadata filtering.

🚫

Symptom: Missing Documents

Fix: Increase top-k, add multi-query retrieval, check if the document was properly ingested and chunked.

💬

Symptom: Hallucinations

Fix: Strengthen the prompt to only use context, add reranking, reduce temperature, or use a more capable model.

📋

Symptom: Incomplete Answers

Fix: Increase context window budget, use parent-child chunking, or add a summarization step for long contexts.

Evaluation is iterative: Run evaluations, identify the weakest metric, fix it, re-evaluate. Repeat until you reach your quality targets. Focus on one metric at a time.

What's Next?

The final lesson covers best practices for production RAG systems, optimization, scaling, and common pitfalls.