Intermediate

RAG Pattern — Retrieval-Augmented Generation

The most widely deployed AI design pattern in production. Learn how to ground LLM responses in your own data using retrieval, chunking, reranking, and query transformation — from naive RAG to advanced self-RAG architectures.

The Problem: LLM Knowledge Limitations

Large language models have a fundamental limitation: they only know what was in their training data. This creates three critical problems for production applications:

Knowledge cutoff: The model does not know about events, documents, or data created after its training date. Ask GPT-4 about something that happened last week and it cannot answer.
No access to private data: The model has never seen your company's internal documentation, customer records, product specs, or proprietary knowledge base.
Hallucination: When the model lacks knowledge, it often generates plausible-sounding but completely fabricated information instead of saying "I do not know."

⚠

The hallucination problem is severe. Studies show that LLMs hallucinate in 15-25% of factual queries when they lack source knowledge. In regulated industries (healthcare, finance, legal), a single hallucination can have serious consequences. RAG reduces hallucination rates to under 5% by grounding responses in retrieved evidence.

The Solution: Retrieve, Then Generate

The RAG pattern solves this by adding a retrieval step before generation. Instead of relying solely on the model's internal knowledge, RAG retrieves relevant documents from an external knowledge base and includes them in the prompt context.

RAG Architecture Diagram

                        RAG ARCHITECTURE
                        ================

  User Query: "What is our refund policy for enterprise plans?"
       |
       v
  +------------------+
  | 1. EMBED QUERY   |  Convert query to vector using embedding model
  +------------------+
       |
       v
  +------------------+
  | 2. RETRIEVE      |  Search vector DB for similar document chunks
  | Vector Database  |  Return top-k most relevant chunks
  +------------------+
       |
       v
  +------------------+
  | 3. RERANK        |  (Optional) Re-score results with cross-encoder
  | Cross-Encoder    |  Improve precision of top results
  +------------------+
       |
       v
  +------------------+
  | 4. AUGMENT       |  Insert retrieved chunks into LLM prompt
  | Prompt Builder   |  "Based on these documents: {chunks}..."
  +------------------+
       |
       v
  +------------------+
  | 5. GENERATE      |  LLM generates answer grounded in retrieved docs
  | LLM (Claude/GPT) |  Includes citations to source documents
  +------------------+
       |
       v
  Answer: "Enterprise plans have a 30-day refund policy..."
          [Source: refund-policy.md, section 4.2]

Three Generations of RAG

RAG has evolved significantly since its introduction. Understanding the three generations helps you choose the right complexity level for your use case.

Naive RAG (Generation 1)

The simplest form: embed, retrieve, generate. No preprocessing of the query, no reranking, and fixed chunking.

# Naive RAG - The simplest implementation
import anthropic
import chromadb

client = anthropic.Anthropic()
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("documents")

def naive_rag(query: str) -> str:
    # Step 1: Retrieve (ChromaDB handles embedding internally)
    results = collection.query(
        query_texts=[query],
        n_results=5
    )

    # Step 2: Build context from retrieved chunks
    context = "\n\n---\n\n".join(results["documents"][0])

    # Step 3: Generate with context
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Answer the question based on the following context.
If the context does not contain enough information, say so.

Context:
{context}

Question: {query}"""
        }]
    )
    return response.content[0].text

💡

Naive RAG is a good starting point. For many use cases, naive RAG achieves 70-80% of the quality of advanced RAG with 10% of the complexity. Start here and add complexity only when you measure specific quality gaps.

Advanced RAG (Generation 2)

Adds pre-retrieval processing (query transformation), improved retrieval (hybrid search), and post-retrieval processing (reranking, compression).

# Advanced RAG - With query transformation, hybrid search, and reranking
import anthropic
import chromadb
from sentence_transformers import CrossEncoder

client = anthropic.Anthropic()
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("documents")

def transform_query(original_query: str) -> list[str]:
    """Generate multiple search queries from the original question."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"""Generate 3 different search queries that would help
answer this question. Return only the queries, one per line.

Question: {original_query}"""
        }]
    )
    queries = response.content[0].text.strip().split("\n")
    return [original_query] + queries  # Include original

def hybrid_retrieve(queries: list[str], k: int = 20) -> list[dict]:
    """Retrieve from multiple queries and deduplicate."""
    all_results = {}
    for query in queries:
        results = collection.query(
            query_texts=[query],
            n_results=k,
            include=["documents", "metadatas", "distances"]
        )
        for i, doc_id in enumerate(results["ids"][0]):
            if doc_id not in all_results:
                all_results[doc_id] = {
                    "id": doc_id,
                    "text": results["documents"][0][i],
                    "metadata": results["metadatas"][0][i],
                    "best_distance": results["distances"][0][i]
                }
            else:
                # Keep the best distance score
                existing = all_results[doc_id]["best_distance"]
                new_dist = results["distances"][0][i]
                if new_dist < existing:
                    all_results[doc_id]["best_distance"] = new_dist
    return list(all_results.values())

def rerank(query: str, candidates: list[dict], top_k: int = 5) -> list[dict]:
    """Re-score candidates with cross-encoder for better precision."""
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    for i, candidate in enumerate(candidates):
        candidate["rerank_score"] = float(scores[i])
    candidates.sort(key=lambda x: x["rerank_score"], reverse=True)
    return candidates[:top_k]

def advanced_rag(query: str) -> str:
    # Step 1: Query transformation
    queries = transform_query(query)

    # Step 2: Hybrid retrieval (multiple queries)
    candidates = hybrid_retrieve(queries, k=20)

    # Step 3: Reranking
    top_chunks = rerank(query, candidates, top_k=5)

    # Step 4: Generate with high-quality context
    context = "\n\n---\n\n".join([
        f"[Source: {c['metadata'].get('source', 'unknown')}]\n{c['text']}"
        for c in top_chunks
    ])

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Answer the question using ONLY the provided context.
Cite your sources using [Source: filename] format.
If the context does not contain the answer, say "I don't have
enough information to answer this question."

Context:
{context}

Question: {query}"""
        }]
    )
    return response.content[0].text

Modular RAG (Generation 3)

Treats each RAG component as a pluggable module. The pipeline dynamically selects which modules to activate based on the query type.

# Modular RAG - Pluggable components with adaptive pipeline
class RAGModule:
    """Base class for all RAG modules."""
    def process(self, state: dict) -> dict:
        raise NotImplementedError

class QueryClassifier(RAGModule):
    """Classify query to determine which pipeline to use."""
    def process(self, state: dict) -> dict:
        query = state["query"]
        # Use LLM to classify query type
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=50,
            messages=[{
                "role": "user",
                "content": f"""Classify this query into one category:
- FACTUAL: needs specific facts from documents
- ANALYTICAL: needs synthesis across multiple documents
- CONVERSATIONAL: general chat, no retrieval needed

Query: {query}
Category:"""
            }]
        )
        state["query_type"] = response.content[0].text.strip()
        return state

class AdaptiveRetriever(RAGModule):
    """Choose retrieval strategy based on query type."""
    def process(self, state: dict) -> dict:
        if state["query_type"] == "CONVERSATIONAL":
            state["chunks"] = []  # No retrieval needed
        elif state["query_type"] == "ANALYTICAL":
            # Broader retrieval for analytical queries
            state["chunks"] = hybrid_retrieve(
                transform_query(state["query"]), k=30
            )
        else:
            # Focused retrieval for factual queries
            state["chunks"] = hybrid_retrieve(
                [state["query"]], k=10
            )
        return state

class SelfReflection(RAGModule):
    """Check if retrieved context is sufficient to answer."""
    def process(self, state: dict) -> dict:
        if not state["chunks"]:
            state["needs_retrieval"] = False
            return state
        context = "\n".join([c["text"][:200] for c in state["chunks"][:5]])
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=50,
            messages=[{
                "role": "user",
                "content": f"""Can this context answer the question?
Context (preview): {context}
Question: {state['query']}
Answer YES or NO:"""
            }]
        )
        state["context_sufficient"] = "YES" in response.content[0].text
        return state

class ModularRAGPipeline:
    """Compose modules into an adaptive pipeline."""
    def __init__(self):
        self.classifier = QueryClassifier()
        self.retriever = AdaptiveRetriever()
        self.reflection = SelfReflection()

    def run(self, query: str) -> str:
        state = {"query": query}
        state = self.classifier.process(state)
        state = self.retriever.process(state)
        state = self.reflection.process(state)

        if not state.get("context_sufficient", True) and state["chunks"]:
            # Retry with expanded retrieval
            state["query_type"] = "ANALYTICAL"
            state = self.retriever.process(state)

        return self._generate(state)

    def _generate(self, state: dict) -> str:
        # ... generate response from final state
        pass

Chunking Strategies

How you split documents into chunks has an enormous impact on RAG quality. The wrong chunking strategy is the most common cause of poor RAG performance.

Strategy	How It Works	Best For	Chunk Size
Fixed-size	Split every N characters/tokens with overlap	Unstructured text, quick prototyping	500-1000 tokens
Sentence-based	Split on sentence boundaries	Well-structured prose, articles	3-5 sentences
Semantic	Use embedding similarity to find topic boundaries	Long documents with topic shifts	Variable
Recursive	Split by headers, then paragraphs, then sentences	Structured docs (markdown, HTML)	Variable
Document-based	Each document is one chunk	Short documents (emails, tickets)	Entire doc
Parent-child	Index small chunks, retrieve parent sections	Precision retrieval with full context	Small index, large retrieval

# Chunking implementation examples
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
)

# Fixed-size chunking with overlap
fixed_splitter = TokenTextSplitter(
    chunk_size=500,
    chunk_overlap=50  # 10% overlap prevents cutting mid-thought
)

# Recursive chunking (best general-purpose choice)
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],  # Try headers first
    length_function=len
)

# Parent-child chunking strategy
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

def parent_child_chunk(document: str):
    """Index small chunks, but retrieve their parent sections."""
    parent_chunks = parent_splitter.split_text(document)
    index_entries = []
    for parent_idx, parent in enumerate(parent_chunks):
        children = child_splitter.split_text(parent)
        for child in children:
            index_entries.append({
                "text": child,           # Indexed for search
                "parent_text": parent,   # Returned on retrieval
                "parent_id": parent_idx
            })
    return index_entries

Retrieval Strategies

How you search for relevant chunks is the second most important factor in RAG quality (after chunking).

Dense Retrieval (Vector Search)

Convert query and documents to vectors, find nearest neighbors. Best for semantic similarity.

# Dense retrieval with sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

def dense_retrieve(query: str, corpus_embeddings, corpus_texts, top_k=5):
    query_embedding = model.encode([query])
    similarities = np.dot(corpus_embeddings, query_embedding.T).flatten()
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    return [(corpus_texts[i], similarities[i]) for i in top_indices]

Sparse Retrieval (BM25)

Keyword-based scoring using term frequency. Best for exact term matching.

# Sparse retrieval with BM25
from rank_bm25 import BM25Okapi
import re

def tokenize(text):
    return re.findall(r'\w+', text.lower())

corpus_tokens = [tokenize(doc) for doc in corpus_texts]
bm25 = BM25Okapi(corpus_tokens)

def sparse_retrieve(query: str, top_k=5):
    query_tokens = tokenize(query)
    scores = bm25.get_scores(query_tokens)
    top_indices = np.argsort(scores)[-top_k:][::-1]
    return [(corpus_texts[i], scores[i]) for i in top_indices]

Hybrid Retrieval (Dense + Sparse)

Combine both approaches using Reciprocal Rank Fusion (RRF) for the best of both worlds.

# Hybrid retrieval with Reciprocal Rank Fusion
def reciprocal_rank_fusion(
    dense_results: list,
    sparse_results: list,
    k: int = 60
) -> list:
    """Combine dense and sparse results using RRF."""
    scores = {}
    for rank, (text, _) in enumerate(dense_results):
        scores[text] = scores.get(text, 0) + 1.0 / (k + rank + 1)
    for rank, (text, _) in enumerate(sparse_results):
        scores[text] = scores.get(text, 0) + 1.0 / (k + rank + 1)
    sorted_results = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_results

# Usage
dense = dense_retrieve(query, embeddings, texts, top_k=20)
sparse = sparse_retrieve(query, top_k=20)
hybrid = reciprocal_rank_fusion(dense, sparse)
final_chunks = [text for text, score in hybrid[:5]]

Query Transformation Techniques

The user's raw query is often not the best search query. These techniques transform it for better retrieval.

HyDE (Hypothetical Document Embedding)

# HyDE: Generate a hypothetical answer, then search with THAT
def hyde_retrieve(query: str, collection, top_k=5):
    # Step 1: Generate hypothetical answer
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"Write a short paragraph that would answer: {query}"
        }]
    )
    hypothetical_doc = response.content[0].text

    # Step 2: Search using the hypothetical document as the query
    # The hypothetical doc is closer in embedding space to real docs
    results = collection.query(
        query_texts=[hypothetical_doc],
        n_results=top_k
    )
    return results

Step-Back Prompting

# Step-back: Ask a broader question first for better context
def step_back_retrieve(query: str, collection, top_k=5):
    # Generate a more general version of the question
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"""Given this specific question, generate a broader
"step-back" question that would help provide context.

Specific: {query}
Step-back question:"""
        }]
    )
    broad_query = response.content[0].text.strip()

    # Retrieve for both queries
    specific_results = collection.query(query_texts=[query], n_results=top_k)
    broad_results = collection.query(query_texts=[broad_query], n_results=top_k)

    # Combine and deduplicate
    return merge_results(specific_results, broad_results)

Query Decomposition

# Decompose complex queries into sub-questions
def decompose_and_retrieve(query: str, collection, top_k=3):
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"""Break this complex question into 2-4 simpler
sub-questions that can be answered independently.
Return one question per line.

Complex question: {query}"""
        }]
    )
    sub_questions = response.content[0].text.strip().split("\n")

    all_chunks = []
    for sub_q in sub_questions:
        results = collection.query(query_texts=[sub_q.strip()], n_results=top_k)
        all_chunks.extend(zip(results["documents"][0], results["ids"][0]))

    # Deduplicate by ID
    seen = set()
    unique_chunks = []
    for text, doc_id in all_chunks:
        if doc_id not in seen:
            seen.add(doc_id)
            unique_chunks.append(text)
    return unique_chunks

Self-RAG: Adaptive Retrieval

Self-RAG lets the model decide whether retrieval is even necessary, and critically evaluates the relevance and support of retrieved documents.

# Self-RAG: The model decides when and how to retrieve
def self_rag(query: str, collection) -> str:
    # Step 1: Does this query need retrieval?
    need_retrieval = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=10,
        messages=[{
            "role": "user",
            "content": f"""Does answering this question require looking up
specific factual information? Answer YES or NO.

Question: {query}"""
        }]
    )

    if "NO" in need_retrieval.content[0].text:
        # Answer directly without retrieval
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{"role": "user", "content": query}]
        )
        return response.content[0].text

    # Step 2: Retrieve
    results = collection.query(query_texts=[query], n_results=5)
    chunks = results["documents"][0]

    # Step 3: Evaluate relevance of each chunk
    relevant_chunks = []
    for chunk in chunks:
        relevance = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": f"""Is this document relevant to the question?
Document: {chunk[:500]}
Question: {query}
Answer RELEVANT or IRRELEVANT:"""
            }]
        )
        if "RELEVANT" in relevance.content[0].text:
            relevant_chunks.append(chunk)

    if not relevant_chunks:
        return "I could not find relevant information to answer this question."

    # Step 4: Generate with validated context
    context = "\n\n---\n\n".join(relevant_chunks)
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Answer based on the verified context below.
Context:\n{context}\n\nQuestion: {query}"""
        }]
    )
    return response.content[0].text

Complete RAG Pipeline with LangChain

# Production-ready RAG with LangChain + Claude + ChromaDB
from langchain_anthropic import ChatAnthropic
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.document_loaders import DirectoryLoader

# 1. Load documents
loader = DirectoryLoader("./docs/", glob="**/*.md")
documents = loader.load()
print(f"Loaded {len(documents)} documents")

# 2. Chunk documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

# 3. Create vector store
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 4. Create retriever with search kwargs
retriever = vectorstore.as_retriever(
    search_type="mmr",          # Maximum Marginal Relevance for diversity
    search_kwargs={"k": 5, "fetch_k": 20}
)

# 5. Define prompt template
prompt = PromptTemplate(
    template="""You are a helpful assistant. Answer the question based ONLY
on the following context. If the context does not contain the answer,
say "I don't have enough information to answer this."

Always cite your sources.

Context:
{context}

Question: {question}

Answer:""",
    input_variables=["context", "question"]
)

# 6. Create chain
llm = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True
)

# 7. Query
result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])
for doc in result["source_documents"]:
    print(f"  Source: {doc.metadata['source']}")

RAG Anti-Patterns (Common Mistakes)

⚠

Avoid these common RAG mistakes:

Chunks too large (2000+ tokens): Large chunks dilute relevance. The retrieved chunk contains your answer but also 80% irrelevant text that confuses the LLM.
Chunks too small (50-100 tokens): Small chunks lose context. A sentence fragment about "the refund policy" is useless without knowing which product it applies to.
No overlap between chunks: Sentences that span chunk boundaries get cut in half. Always use 10-20% overlap.
Skipping reranking: Vector search is approximate. Reranking with a cross-encoder improves top-5 precision by 15-30% with minimal latency cost.
Ignoring metadata: Storing only text without source, date, author, or section info makes it impossible to filter, cite, or debug.
Not testing retrieval separately: Measure retrieval quality (recall@k, precision@k) independently from generation quality. Poor retrieval cannot be fixed by a better LLM.
Stuffing too many chunks: Including 20 chunks in the prompt fills the context window with noise. 3-5 high-quality chunks outperform 20 mediocre ones.

When NOT to Use RAG

RAG is not always the right pattern. Do not use RAG when:

The LLM already knows the answer: For general knowledge questions (e.g., "explain photosynthesis"), RAG adds latency and cost without improving quality.
Your data fits in the context window: If your entire knowledge base is under 100K tokens, just include it all in the prompt. No retrieval needed.
You need real-time data: RAG works with indexed data. If you need live database queries or API calls, use an Agent pattern with tool use instead.
The task is creative: Writing fiction, brainstorming ideas, or generating creative content does not benefit from document retrieval.
You need structured data queries: "Show me all orders over $500 from last month" requires SQL, not semantic search. Use text-to-SQL instead of RAG.

RAG Evaluation Metrics

Metric	What It Measures	Target
Retrieval Recall@k	% of relevant docs found in top-k results	> 85%
Retrieval Precision@k	% of top-k results that are actually relevant	> 70%
Answer Faithfulness	Is the answer supported by retrieved context?	> 90%
Answer Relevance	Does the answer address the question asked?	> 85%
Context Relevance	Are retrieved chunks relevant to the question?	> 75%
Hallucination Rate	% of responses with unsupported claims	< 5%

What's Next

In the next lesson, we explore the Agent and ReAct pattern — how to build AI systems that can reason about tasks, use tools, and take multi-step actions autonomously. You will learn the observe-think-act loop, tool calling, multi-agent architectures, and when agents are the right choice versus simpler patterns like prompt chaining.

← Previous Introduction Next → Agent & ReAct