Advanced

RAG Best Practices & Checklist

This final lesson distills everything from the course into actionable checklists, common failure modes with fixes, a debugging playbook, and answers to the most frequently asked questions from RAG engineers in production.

Production RAG Deployment Checklist

Use this checklist before deploying any RAG system to production. Each item maps back to a lesson in this course.

Ingestion (Lesson 2)

Chunking strategy selected and tested — Recursive splitting at 512 tokens with 50-token overlap as baseline, adjusted based on evaluation
Document parsers tested with real documents — PDFs with tables, scanned docs, HTML with complex layouts
Metadata attached to every chunk — source, page, section, date, access level at minimum
Deduplication implemented — Content hashing prevents duplicate chunks
Incremental update pipeline built — New/modified/deleted docs sync without full re-index

Embedding & Indexing (Lesson 3)

Embedding model selected and benchmarked — Tested against your specific domain, not just MTEB scores
Vector database deployed with backups — Automated backups, tested restore procedure
HNSW index configured — ef_construction and M parameters tuned for your dataset size
Hybrid search enabled — Vector + BM25 with alpha parameter tuned via evaluation

Retrieval (Lesson 4)

Re-ranking implemented — Cross-encoder or Cohere Rerank on top-20 candidates
Metadata filtering configured — Tenant isolation, access control, date-based filtering
Top-K value optimized — Tested values from 3 to 20, measured impact on answer quality

Generation (Lesson 5)

System prompt constrains answers to context — Model says "I don't know" when context is insufficient
Citation format defined and tested — Inline citations with source verification
Streaming enabled — Time-to-first-token under 500ms
Context window budget managed — 10–15K tokens of context, not the full window

Evaluation (Lesson 6)

Evaluation dataset created — 50+ question-answer pairs covering edge cases
RAGAS metrics baseline established — Faithfulness, relevancy, precision, recall measured
Regression tests in CI/CD pipeline — Quality thresholds enforced on every deployment
User feedback loop implemented — Thumbs up/down, with flagged answers reviewed weekly

Operations (Lesson 7)

Caching layers deployed — Semantic cache, embedding cache, LLM response cache
Monitoring dashboard live — Latency, similarity scores, cache hit rate, cost, error rate
Alerting configured — Low similarity scores, high error rates, cost spikes
Cost tracking per query — Budget forecasting based on actual usage patterns

💡

Apply at work: Copy this checklist into your team's project management tool. Assign each item to a team member. Do not launch to production until every item is checked off. The items you skip will be the items that cause production incidents.

Common Failure Modes and Fixes

These are the problems you will encounter in every production RAG system. Knowing them in advance saves weeks of debugging.

Failure Mode	Symptom	Root Cause	Fix
Irrelevant retrieval	Answer is factually correct but about the wrong topic	Chunks are too generic, embedding model does not capture domain nuance	Add metadata filtering, try domain-specific embeddings, implement re-ranking
Stale answers	Answer cites outdated information	Documents updated but index not refreshed	Implement incremental indexing on a schedule, add date metadata filtering
Hallucination	Answer contains claims not in the retrieved context	System prompt is too permissive, or context is insufficient	Strengthen prompt constraints, add faithfulness evaluation, implement confidence scoring
Lost in the middle	Model ignores relevant chunks in the middle of the context	LLMs attend more to the beginning and end of long contexts	Reduce context to 5–7 chunks, put most relevant chunks first and last
Table/structured data failure	Cannot answer questions about data in tables	Chunking destroyed table structure	Extract tables as complete Markdown chunks, do not split across rows
Multi-hop question failure	Cannot connect information across multiple documents	Each chunk is retrieved independently, no connection between them	Use multi-query retrieval, increase top-K, consider knowledge graphs
Cost explosion	Monthly LLM bill 10x higher than expected	No caching, large context windows, expensive model for every query	Add semantic caching, implement tiered model routing, compress context

Debugging RAG Quality Issues

When your RAG system gives a bad answer, follow this systematic debugging process:

Step 1: Inspect Retrieved Chunks

Before looking at the generated answer, examine the retrieved chunks. Are they relevant to the question? If not, the problem is in retrieval (chunking, embeddings, or search). If they are relevant, the problem is in generation (prompt, model, or context formatting).

Step 2: Check Similarity Scores

Look at the cosine similarity scores of retrieved chunks. If the top results have scores below 0.3, the embedding model cannot represent this query well. Try rephrasing the query or switching embedding models.

Step 3: Verify Chunk Content

Read the actual chunk text. Is the relevant information present but split across chunk boundaries? If so, increase chunk overlap or switch to structure-aware chunking.

Step 4: Test the Prompt Directly

Take the exact system prompt + context + question and paste it into a chat interface. Does the model give a good answer? If yes, the problem may be in context truncation or formatting in your pipeline.

# Debug utility: inspect the full RAG pipeline for a single query
def debug_rag_query(query: str, pipeline) -> dict:
    """Run a query and return all intermediate results for debugging."""
    # Step 1: See what the retriever found
    query_embedding = pipeline.embed(query)
    raw_results = pipeline.vector_store.search(
        query_vector=query_embedding,
        limit=20  # Get more than usual for debugging
    )

    # Step 2: Log similarity scores
    retrieval_debug = []
    for i, result in enumerate(raw_results):
        retrieval_debug.append({
            "rank": i + 1,
            "score": result.score,
            "source": result.payload.get("source"),
            "page": result.payload.get("page"),
            "text_preview": result.payload["text"][:200],
        })

    # Step 3: See what re-ranking changed
    reranked = pipeline.rerank(query, raw_results[:20])
    rerank_debug = [{"original_rank": r["original_rank"], "new_rank": i+1,
                     "rerank_score": r["score"]} for i, r in enumerate(reranked[:5])]

    # Step 4: See the exact prompt sent to the LLM
    context = pipeline.format_context(reranked[:5])
    prompt = pipeline.build_prompt(query, context)

    # Step 5: Get the answer
    answer = pipeline.generate(prompt)

    return {
        "query": query,
        "retrieval": retrieval_debug,
        "reranking": rerank_debug,
        "prompt_length_tokens": len(prompt.split()) * 1.3,  # rough estimate
        "answer": answer,
    }

Frequently Asked Questions

What chunk size should I use? ▼

Start with 512 tokens and 50-token overlap using recursive character splitting. This works well for 80% of use cases. If your documents have clear structure (headers, sections), use structure-aware chunking instead. Only switch to semantic chunking if evaluation shows fixed-size chunks are the bottleneck — and measure the tradeoff against the added cost of embedding-based splitting.

How many chunks should I retrieve (top-K)? ▼

Retrieve 10–20 chunks, then re-rank to the top 3–5 for the LLM context. Retrieving too few misses relevant documents. Retrieving too many and passing them all to the LLM dilutes relevance and triggers the "lost in the middle" problem. The re-ranking step is what makes this work — it lets you cast a wide net (high recall) and then narrow down (high precision).

Should I use a managed vector database or self-host? ▼

If you already have PostgreSQL and need fewer than 1M vectors, use pgvector — it avoids adding infrastructure. For dedicated vector workloads, use a managed service (Pinecone, Weaviate Cloud, Qdrant Cloud) unless you have specific compliance requirements that mandate self-hosting. The operational overhead of running a vector database is significant and rarely worth it unless you have a dedicated infrastructure team.

How do I handle documents that change frequently? ▼

Implement incremental indexing with content hashing (covered in Lesson 7). Store a hash of each document. On each sync, compare hashes to detect new, modified, and deleted documents. Only re-process changed documents. Run the sync on a schedule that matches your freshness requirements — every 15 minutes for support docs, daily for policy documents, weekly for archival content.

What is the biggest mistake teams make with RAG? ▼

Building a naive RAG demo, getting 70% accuracy, and concluding that "RAG doesn't work for our use case." The gap between demo RAG and production RAG is significant. The techniques in this course — proper chunking, hybrid search, re-ranking, evaluation metrics, caching — typically push accuracy from 70% to 90%+. The second biggest mistake is not having an evaluation framework. Without metrics, you cannot tell whether your changes are helping or hurting.

Can I use RAG with open-source LLMs? ▼

Yes. RAG works with any LLM that can process a prompt with context. Open-source models like LLaMA 3, Mistral, and Qwen work well as generators. For embeddings, nomic-embed-text and bge-large are strong open-source options. The main tradeoff is that smaller open-source models may follow RAG prompt constraints less reliably than GPT-4o or Claude, so you may need more prompt engineering to prevent hallucination.

How do I handle multi-language documents? ▼

Use a multilingual embedding model (Cohere embed-v4 or multilingual-e5-large). These models map texts in different languages to the same vector space, so a query in English can retrieve a document in French. For the generator, use a model that supports multiple languages (GPT-4o, Claude). Test your evaluation metrics across languages — retrieval quality often varies significantly between languages.

When should I combine RAG with fine-tuning? ▼

Combine them when you need both custom behavior AND private knowledge. Fine-tune the model to produce outputs in a specific format, tone, or reasoning style (e.g., always respond in bullet points, always include a confidence level, always use medical terminology correctly). Then use RAG to inject the specific facts and data the model needs to answer each question. This hybrid approach gives you the best of both worlds but is more complex to maintain.

Course Summary

You now have the knowledge to design, build, and operate a production-grade RAG system. Here is the path from zero to production:

Week	Focus	Deliverable
1	Ingestion + Embedding	Documents chunked, embedded, and indexed in vector DB
2	Basic RAG pipeline	End-to-end query-to-answer working with naive retrieval
3	Advanced retrieval	Hybrid search + re-ranking deployed, 15%+ accuracy improvement
4	Evaluation	RAGAS metrics baselined, 50+ test cases, regression tests in CI
5	Production hardening	Caching, monitoring, alerting, incremental indexing deployed
6	Launch	Production deployment with A/B testing and user feedback loop

💡

Apply at work: Share this 6-week roadmap with your manager as a project plan. It gives concrete deliverables for each sprint and sets realistic expectations about what it takes to go from "RAG demo" to "RAG in production." Most teams underestimate weeks 4–5 (evaluation and production hardening), which is where the real work happens.

← Previous Scaling RAG in Production Course Home → Designing RAG Systems