RAG Best Practices & Checklist
This final lesson distills everything from the course into actionable checklists, common failure modes with fixes, a debugging playbook, and answers to the most frequently asked questions from RAG engineers in production.
Production RAG Deployment Checklist
Use this checklist before deploying any RAG system to production. Each item maps back to a lesson in this course.
Ingestion (Lesson 2)
- Chunking strategy selected and tested — Recursive splitting at 512 tokens with 50-token overlap as baseline, adjusted based on evaluation
- Document parsers tested with real documents — PDFs with tables, scanned docs, HTML with complex layouts
- Metadata attached to every chunk — source, page, section, date, access level at minimum
- Deduplication implemented — Content hashing prevents duplicate chunks
- Incremental update pipeline built — New/modified/deleted docs sync without full re-index
Embedding & Indexing (Lesson 3)
- Embedding model selected and benchmarked — Tested against your specific domain, not just MTEB scores
- Vector database deployed with backups — Automated backups, tested restore procedure
- HNSW index configured — ef_construction and M parameters tuned for your dataset size
- Hybrid search enabled — Vector + BM25 with alpha parameter tuned via evaluation
Retrieval (Lesson 4)
- Re-ranking implemented — Cross-encoder or Cohere Rerank on top-20 candidates
- Metadata filtering configured — Tenant isolation, access control, date-based filtering
- Top-K value optimized — Tested values from 3 to 20, measured impact on answer quality
Generation (Lesson 5)
- System prompt constrains answers to context — Model says "I don't know" when context is insufficient
- Citation format defined and tested — Inline citations with source verification
- Streaming enabled — Time-to-first-token under 500ms
- Context window budget managed — 10–15K tokens of context, not the full window
Evaluation (Lesson 6)
- Evaluation dataset created — 50+ question-answer pairs covering edge cases
- RAGAS metrics baseline established — Faithfulness, relevancy, precision, recall measured
- Regression tests in CI/CD pipeline — Quality thresholds enforced on every deployment
- User feedback loop implemented — Thumbs up/down, with flagged answers reviewed weekly
Operations (Lesson 7)
- Caching layers deployed — Semantic cache, embedding cache, LLM response cache
- Monitoring dashboard live — Latency, similarity scores, cache hit rate, cost, error rate
- Alerting configured — Low similarity scores, high error rates, cost spikes
- Cost tracking per query — Budget forecasting based on actual usage patterns
Common Failure Modes and Fixes
These are the problems you will encounter in every production RAG system. Knowing them in advance saves weeks of debugging.
| Failure Mode | Symptom | Root Cause | Fix |
|---|---|---|---|
| Irrelevant retrieval | Answer is factually correct but about the wrong topic | Chunks are too generic, embedding model does not capture domain nuance | Add metadata filtering, try domain-specific embeddings, implement re-ranking |
| Stale answers | Answer cites outdated information | Documents updated but index not refreshed | Implement incremental indexing on a schedule, add date metadata filtering |
| Hallucination | Answer contains claims not in the retrieved context | System prompt is too permissive, or context is insufficient | Strengthen prompt constraints, add faithfulness evaluation, implement confidence scoring |
| Lost in the middle | Model ignores relevant chunks in the middle of the context | LLMs attend more to the beginning and end of long contexts | Reduce context to 5–7 chunks, put most relevant chunks first and last |
| Table/structured data failure | Cannot answer questions about data in tables | Chunking destroyed table structure | Extract tables as complete Markdown chunks, do not split across rows |
| Multi-hop question failure | Cannot connect information across multiple documents | Each chunk is retrieved independently, no connection between them | Use multi-query retrieval, increase top-K, consider knowledge graphs |
| Cost explosion | Monthly LLM bill 10x higher than expected | No caching, large context windows, expensive model for every query | Add semantic caching, implement tiered model routing, compress context |
Debugging RAG Quality Issues
When your RAG system gives a bad answer, follow this systematic debugging process:
Step 1: Inspect Retrieved Chunks
Before looking at the generated answer, examine the retrieved chunks. Are they relevant to the question? If not, the problem is in retrieval (chunking, embeddings, or search). If they are relevant, the problem is in generation (prompt, model, or context formatting).
Step 2: Check Similarity Scores
Look at the cosine similarity scores of retrieved chunks. If the top results have scores below 0.3, the embedding model cannot represent this query well. Try rephrasing the query or switching embedding models.
Step 3: Verify Chunk Content
Read the actual chunk text. Is the relevant information present but split across chunk boundaries? If so, increase chunk overlap or switch to structure-aware chunking.
Step 4: Test the Prompt Directly
Take the exact system prompt + context + question and paste it into a chat interface. Does the model give a good answer? If yes, the problem may be in context truncation or formatting in your pipeline.
# Debug utility: inspect the full RAG pipeline for a single query
def debug_rag_query(query: str, pipeline) -> dict:
"""Run a query and return all intermediate results for debugging."""
# Step 1: See what the retriever found
query_embedding = pipeline.embed(query)
raw_results = pipeline.vector_store.search(
query_vector=query_embedding,
limit=20 # Get more than usual for debugging
)
# Step 2: Log similarity scores
retrieval_debug = []
for i, result in enumerate(raw_results):
retrieval_debug.append({
"rank": i + 1,
"score": result.score,
"source": result.payload.get("source"),
"page": result.payload.get("page"),
"text_preview": result.payload["text"][:200],
})
# Step 3: See what re-ranking changed
reranked = pipeline.rerank(query, raw_results[:20])
rerank_debug = [{"original_rank": r["original_rank"], "new_rank": i+1,
"rerank_score": r["score"]} for i, r in enumerate(reranked[:5])]
# Step 4: See the exact prompt sent to the LLM
context = pipeline.format_context(reranked[:5])
prompt = pipeline.build_prompt(query, context)
# Step 5: Get the answer
answer = pipeline.generate(prompt)
return {
"query": query,
"retrieval": retrieval_debug,
"reranking": rerank_debug,
"prompt_length_tokens": len(prompt.split()) * 1.3, # rough estimate
"answer": answer,
}
Frequently Asked Questions
Start with 512 tokens and 50-token overlap using recursive character splitting. This works well for 80% of use cases. If your documents have clear structure (headers, sections), use structure-aware chunking instead. Only switch to semantic chunking if evaluation shows fixed-size chunks are the bottleneck — and measure the tradeoff against the added cost of embedding-based splitting.
Retrieve 10–20 chunks, then re-rank to the top 3–5 for the LLM context. Retrieving too few misses relevant documents. Retrieving too many and passing them all to the LLM dilutes relevance and triggers the "lost in the middle" problem. The re-ranking step is what makes this work — it lets you cast a wide net (high recall) and then narrow down (high precision).
If you already have PostgreSQL and need fewer than 1M vectors, use pgvector — it avoids adding infrastructure. For dedicated vector workloads, use a managed service (Pinecone, Weaviate Cloud, Qdrant Cloud) unless you have specific compliance requirements that mandate self-hosting. The operational overhead of running a vector database is significant and rarely worth it unless you have a dedicated infrastructure team.
Implement incremental indexing with content hashing (covered in Lesson 7). Store a hash of each document. On each sync, compare hashes to detect new, modified, and deleted documents. Only re-process changed documents. Run the sync on a schedule that matches your freshness requirements — every 15 minutes for support docs, daily for policy documents, weekly for archival content.
Building a naive RAG demo, getting 70% accuracy, and concluding that "RAG doesn't work for our use case." The gap between demo RAG and production RAG is significant. The techniques in this course — proper chunking, hybrid search, re-ranking, evaluation metrics, caching — typically push accuracy from 70% to 90%+. The second biggest mistake is not having an evaluation framework. Without metrics, you cannot tell whether your changes are helping or hurting.
Yes. RAG works with any LLM that can process a prompt with context. Open-source models like LLaMA 3, Mistral, and Qwen work well as generators. For embeddings, nomic-embed-text and bge-large are strong open-source options. The main tradeoff is that smaller open-source models may follow RAG prompt constraints less reliably than GPT-4o or Claude, so you may need more prompt engineering to prevent hallucination.
Use a multilingual embedding model (Cohere embed-v4 or multilingual-e5-large). These models map texts in different languages to the same vector space, so a query in English can retrieve a document in French. For the generator, use a model that supports multiple languages (GPT-4o, Claude). Test your evaluation metrics across languages — retrieval quality often varies significantly between languages.
Combine them when you need both custom behavior AND private knowledge. Fine-tune the model to produce outputs in a specific format, tone, or reasoning style (e.g., always respond in bullet points, always include a confidence level, always use medical terminology correctly). Then use RAG to inject the specific facts and data the model needs to answer each question. This hybrid approach gives you the best of both worlds but is more complex to maintain.
Course Summary
You now have the knowledge to design, build, and operate a production-grade RAG system. Here is the path from zero to production:
| Week | Focus | Deliverable |
|---|---|---|
| 1 | Ingestion + Embedding | Documents chunked, embedded, and indexed in vector DB |
| 2 | Basic RAG pipeline | End-to-end query-to-answer working with naive retrieval |
| 3 | Advanced retrieval | Hybrid search + re-ranking deployed, 15%+ accuracy improvement |
| 4 | Evaluation | RAGAS metrics baselined, 50+ test cases, regression tests in CI |
| 5 | Production hardening | Caching, monitoring, alerting, incremental indexing deployed |
| 6 | Launch | Production deployment with A/B testing and user feedback loop |
Lilly Tech Systems