RAG Evaluation
Measure and improve RAG quality with industry-standard metrics and evaluation frameworks.
Core RAG Metrics
| Metric | What It Measures | Low Score Means |
|---|---|---|
| Faithfulness | Is the answer supported by the retrieved context? | Model is hallucinating beyond the context |
| Answer Relevancy | Does the answer actually address the question? | Answer is off-topic or incomplete |
| Context Precision | Are the retrieved documents relevant? | Retriever is returning irrelevant chunks |
| Context Recall | Did retrieval find all relevant documents? | Important documents are being missed |
RAGAS Framework
RAGAS (Retrieval Augmented Generation Assessment) is the most popular open-source evaluation framework for RAG:
pip install ragas from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall ) from datasets import Dataset # Prepare evaluation dataset eval_data = { "question": [ "What is the refund policy?", "How do I reset my password?", ], "answer": [ "Returns are accepted within 30 days...", "Go to Settings > Security > Reset Password...", ], "contexts": [ ["Our return policy allows 30-day returns..."], ["To reset your password, navigate to..."], ], "ground_truth": [ "30-day return policy for unused items.", "Settings > Security > Reset Password.", ] } dataset = Dataset.from_dict(eval_data) # Run evaluation results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall] ) print(results) # {'faithfulness': 0.92, 'answer_relevancy': 0.88, # 'context_precision': 0.85, 'context_recall': 0.90}
TruLens
TruLens provides real-time monitoring and evaluation of RAG applications:
pip install trulens-eval from trulens_eval import Tru, TruChain, Feedback from trulens_eval.feedback.provider import OpenAI tru = Tru() provider = OpenAI() # Define feedback functions f_relevance = Feedback(provider.relevance).on_input_output() f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons) # Wrap your RAG chain for monitoring tru_chain = TruChain( rag_chain, app_id="my_rag_v1", feedbacks=[f_relevance, f_groundedness] ) # Run queries - metrics are automatically recorded with tru_chain as recording: response = tru_chain.invoke("How do I deploy?") # View results in dashboard tru.run_dashboard()
LangSmith
LangSmith by LangChain provides tracing, evaluation, and monitoring for RAG pipelines:
# Set environment variables import os os.environ["LANGCHAIN_TRACING_V2"] = "true" os.environ["LANGCHAIN_API_KEY"] = "ls_..." # All LangChain calls are now automatically traced # View traces at smith.langchain.com # Create evaluation dataset from langsmith import Client client = Client() dataset = client.create_dataset("rag-eval") # Add examples client.create_example( inputs={"question": "What is the refund policy?"}, outputs={"answer": "30-day return policy..."}, dataset_id=dataset.id )
Building Evaluation Datasets
-
Collect Real Questions
Gather actual questions from users, support tickets, or search logs. These are the most realistic test cases.
-
Create Ground Truth Answers
Have domain experts write correct answers for each question. Include the source documents that contain the answer.
-
Cover Edge Cases
Include questions that should return "I don't know," questions with multiple valid answers, and questions that require information from multiple documents.
-
Aim for 50-200 Examples
Start with 50 diverse examples. Expand to 200+ for comprehensive evaluation. Quality matters more than quantity.
Debugging Poor Retrieval
Symptom: Irrelevant Results
Fix: Try different embedding models, add hybrid search, adjust chunk sizes, or add metadata filtering.
Symptom: Missing Documents
Fix: Increase top-k, add multi-query retrieval, check if the document was properly ingested and chunked.
Symptom: Hallucinations
Fix: Strengthen the prompt to only use context, add reranking, reduce temperature, or use a more capable model.
Symptom: Incomplete Answers
Fix: Increase context window budget, use parent-child chunking, or add a summarization step for long contexts.
What's Next?
The final lesson covers best practices for production RAG systems, optimization, scaling, and common pitfalls.
Lilly Tech Systems