RAG Pattern — Retrieval-Augmented Generation
The most widely deployed AI design pattern in production. Learn how to ground LLM responses in your own data using retrieval, chunking, reranking, and query transformation — from naive RAG to advanced self-RAG architectures.
The Problem: LLM Knowledge Limitations
Large language models have a fundamental limitation: they only know what was in their training data. This creates three critical problems for production applications:
- Knowledge cutoff: The model does not know about events, documents, or data created after its training date. Ask GPT-4 about something that happened last week and it cannot answer.
- No access to private data: The model has never seen your company's internal documentation, customer records, product specs, or proprietary knowledge base.
- Hallucination: When the model lacks knowledge, it often generates plausible-sounding but completely fabricated information instead of saying "I do not know."
The Solution: Retrieve, Then Generate
The RAG pattern solves this by adding a retrieval step before generation. Instead of relying solely on the model's internal knowledge, RAG retrieves relevant documents from an external knowledge base and includes them in the prompt context.
RAG Architecture Diagram
RAG ARCHITECTURE
================
User Query: "What is our refund policy for enterprise plans?"
|
v
+------------------+
| 1. EMBED QUERY | Convert query to vector using embedding model
+------------------+
|
v
+------------------+
| 2. RETRIEVE | Search vector DB for similar document chunks
| Vector Database | Return top-k most relevant chunks
+------------------+
|
v
+------------------+
| 3. RERANK | (Optional) Re-score results with cross-encoder
| Cross-Encoder | Improve precision of top results
+------------------+
|
v
+------------------+
| 4. AUGMENT | Insert retrieved chunks into LLM prompt
| Prompt Builder | "Based on these documents: {chunks}..."
+------------------+
|
v
+------------------+
| 5. GENERATE | LLM generates answer grounded in retrieved docs
| LLM (Claude/GPT) | Includes citations to source documents
+------------------+
|
v
Answer: "Enterprise plans have a 30-day refund policy..."
[Source: refund-policy.md, section 4.2]
Three Generations of RAG
RAG has evolved significantly since its introduction. Understanding the three generations helps you choose the right complexity level for your use case.
Naive RAG (Generation 1)
The simplest form: embed, retrieve, generate. No preprocessing of the query, no reranking, and fixed chunking.
# Naive RAG - The simplest implementation
import anthropic
import chromadb
client = anthropic.Anthropic()
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("documents")
def naive_rag(query: str) -> str:
# Step 1: Retrieve (ChromaDB handles embedding internally)
results = collection.query(
query_texts=[query],
n_results=5
)
# Step 2: Build context from retrieved chunks
context = "\n\n---\n\n".join(results["documents"][0])
# Step 3: Generate with context
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Answer the question based on the following context.
If the context does not contain enough information, say so.
Context:
{context}
Question: {query}"""
}]
)
return response.content[0].text
Advanced RAG (Generation 2)
Adds pre-retrieval processing (query transformation), improved retrieval (hybrid search), and post-retrieval processing (reranking, compression).
# Advanced RAG - With query transformation, hybrid search, and reranking
import anthropic
import chromadb
from sentence_transformers import CrossEncoder
client = anthropic.Anthropic()
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("documents")
def transform_query(original_query: str) -> list[str]:
"""Generate multiple search queries from the original question."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=300,
messages=[{
"role": "user",
"content": f"""Generate 3 different search queries that would help
answer this question. Return only the queries, one per line.
Question: {original_query}"""
}]
)
queries = response.content[0].text.strip().split("\n")
return [original_query] + queries # Include original
def hybrid_retrieve(queries: list[str], k: int = 20) -> list[dict]:
"""Retrieve from multiple queries and deduplicate."""
all_results = {}
for query in queries:
results = collection.query(
query_texts=[query],
n_results=k,
include=["documents", "metadatas", "distances"]
)
for i, doc_id in enumerate(results["ids"][0]):
if doc_id not in all_results:
all_results[doc_id] = {
"id": doc_id,
"text": results["documents"][0][i],
"metadata": results["metadatas"][0][i],
"best_distance": results["distances"][0][i]
}
else:
# Keep the best distance score
existing = all_results[doc_id]["best_distance"]
new_dist = results["distances"][0][i]
if new_dist < existing:
all_results[doc_id]["best_distance"] = new_dist
return list(all_results.values())
def rerank(query: str, candidates: list[dict], top_k: int = 5) -> list[dict]:
"""Re-score candidates with cross-encoder for better precision."""
pairs = [(query, c["text"]) for c in candidates]
scores = reranker.predict(pairs)
for i, candidate in enumerate(candidates):
candidate["rerank_score"] = float(scores[i])
candidates.sort(key=lambda x: x["rerank_score"], reverse=True)
return candidates[:top_k]
def advanced_rag(query: str) -> str:
# Step 1: Query transformation
queries = transform_query(query)
# Step 2: Hybrid retrieval (multiple queries)
candidates = hybrid_retrieve(queries, k=20)
# Step 3: Reranking
top_chunks = rerank(query, candidates, top_k=5)
# Step 4: Generate with high-quality context
context = "\n\n---\n\n".join([
f"[Source: {c['metadata'].get('source', 'unknown')}]\n{c['text']}"
for c in top_chunks
])
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Answer the question using ONLY the provided context.
Cite your sources using [Source: filename] format.
If the context does not contain the answer, say "I don't have
enough information to answer this question."
Context:
{context}
Question: {query}"""
}]
)
return response.content[0].text
Modular RAG (Generation 3)
Treats each RAG component as a pluggable module. The pipeline dynamically selects which modules to activate based on the query type.
# Modular RAG - Pluggable components with adaptive pipeline
class RAGModule:
"""Base class for all RAG modules."""
def process(self, state: dict) -> dict:
raise NotImplementedError
class QueryClassifier(RAGModule):
"""Classify query to determine which pipeline to use."""
def process(self, state: dict) -> dict:
query = state["query"]
# Use LLM to classify query type
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=50,
messages=[{
"role": "user",
"content": f"""Classify this query into one category:
- FACTUAL: needs specific facts from documents
- ANALYTICAL: needs synthesis across multiple documents
- CONVERSATIONAL: general chat, no retrieval needed
Query: {query}
Category:"""
}]
)
state["query_type"] = response.content[0].text.strip()
return state
class AdaptiveRetriever(RAGModule):
"""Choose retrieval strategy based on query type."""
def process(self, state: dict) -> dict:
if state["query_type"] == "CONVERSATIONAL":
state["chunks"] = [] # No retrieval needed
elif state["query_type"] == "ANALYTICAL":
# Broader retrieval for analytical queries
state["chunks"] = hybrid_retrieve(
transform_query(state["query"]), k=30
)
else:
# Focused retrieval for factual queries
state["chunks"] = hybrid_retrieve(
[state["query"]], k=10
)
return state
class SelfReflection(RAGModule):
"""Check if retrieved context is sufficient to answer."""
def process(self, state: dict) -> dict:
if not state["chunks"]:
state["needs_retrieval"] = False
return state
context = "\n".join([c["text"][:200] for c in state["chunks"][:5]])
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=50,
messages=[{
"role": "user",
"content": f"""Can this context answer the question?
Context (preview): {context}
Question: {state['query']}
Answer YES or NO:"""
}]
)
state["context_sufficient"] = "YES" in response.content[0].text
return state
class ModularRAGPipeline:
"""Compose modules into an adaptive pipeline."""
def __init__(self):
self.classifier = QueryClassifier()
self.retriever = AdaptiveRetriever()
self.reflection = SelfReflection()
def run(self, query: str) -> str:
state = {"query": query}
state = self.classifier.process(state)
state = self.retriever.process(state)
state = self.reflection.process(state)
if not state.get("context_sufficient", True) and state["chunks"]:
# Retry with expanded retrieval
state["query_type"] = "ANALYTICAL"
state = self.retriever.process(state)
return self._generate(state)
def _generate(self, state: dict) -> str:
# ... generate response from final state
pass
Chunking Strategies
How you split documents into chunks has an enormous impact on RAG quality. The wrong chunking strategy is the most common cause of poor RAG performance.
| Strategy | How It Works | Best For | Chunk Size |
|---|---|---|---|
| Fixed-size | Split every N characters/tokens with overlap | Unstructured text, quick prototyping | 500-1000 tokens |
| Sentence-based | Split on sentence boundaries | Well-structured prose, articles | 3-5 sentences |
| Semantic | Use embedding similarity to find topic boundaries | Long documents with topic shifts | Variable |
| Recursive | Split by headers, then paragraphs, then sentences | Structured docs (markdown, HTML) | Variable |
| Document-based | Each document is one chunk | Short documents (emails, tickets) | Entire doc |
| Parent-child | Index small chunks, retrieve parent sections | Precision retrieval with full context | Small index, large retrieval |
# Chunking implementation examples
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
TokenTextSplitter,
)
# Fixed-size chunking with overlap
fixed_splitter = TokenTextSplitter(
chunk_size=500,
chunk_overlap=50 # 10% overlap prevents cutting mid-thought
)
# Recursive chunking (best general-purpose choice)
recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""], # Try headers first
length_function=len
)
# Parent-child chunking strategy
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
def parent_child_chunk(document: str):
"""Index small chunks, but retrieve their parent sections."""
parent_chunks = parent_splitter.split_text(document)
index_entries = []
for parent_idx, parent in enumerate(parent_chunks):
children = child_splitter.split_text(parent)
for child in children:
index_entries.append({
"text": child, # Indexed for search
"parent_text": parent, # Returned on retrieval
"parent_id": parent_idx
})
return index_entries
Retrieval Strategies
How you search for relevant chunks is the second most important factor in RAG quality (after chunking).
Dense Retrieval (Vector Search)
Convert query and documents to vectors, find nearest neighbors. Best for semantic similarity.
# Dense retrieval with sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
def dense_retrieve(query: str, corpus_embeddings, corpus_texts, top_k=5):
query_embedding = model.encode([query])
similarities = np.dot(corpus_embeddings, query_embedding.T).flatten()
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [(corpus_texts[i], similarities[i]) for i in top_indices]
Sparse Retrieval (BM25)
Keyword-based scoring using term frequency. Best for exact term matching.
# Sparse retrieval with BM25
from rank_bm25 import BM25Okapi
import re
def tokenize(text):
return re.findall(r'\w+', text.lower())
corpus_tokens = [tokenize(doc) for doc in corpus_texts]
bm25 = BM25Okapi(corpus_tokens)
def sparse_retrieve(query: str, top_k=5):
query_tokens = tokenize(query)
scores = bm25.get_scores(query_tokens)
top_indices = np.argsort(scores)[-top_k:][::-1]
return [(corpus_texts[i], scores[i]) for i in top_indices]
Hybrid Retrieval (Dense + Sparse)
Combine both approaches using Reciprocal Rank Fusion (RRF) for the best of both worlds.
# Hybrid retrieval with Reciprocal Rank Fusion
def reciprocal_rank_fusion(
dense_results: list,
sparse_results: list,
k: int = 60
) -> list:
"""Combine dense and sparse results using RRF."""
scores = {}
for rank, (text, _) in enumerate(dense_results):
scores[text] = scores.get(text, 0) + 1.0 / (k + rank + 1)
for rank, (text, _) in enumerate(sparse_results):
scores[text] = scores.get(text, 0) + 1.0 / (k + rank + 1)
sorted_results = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return sorted_results
# Usage
dense = dense_retrieve(query, embeddings, texts, top_k=20)
sparse = sparse_retrieve(query, top_k=20)
hybrid = reciprocal_rank_fusion(dense, sparse)
final_chunks = [text for text, score in hybrid[:5]]
Query Transformation Techniques
The user's raw query is often not the best search query. These techniques transform it for better retrieval.
HyDE (Hypothetical Document Embedding)
# HyDE: Generate a hypothetical answer, then search with THAT
def hyde_retrieve(query: str, collection, top_k=5):
# Step 1: Generate hypothetical answer
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Write a short paragraph that would answer: {query}"
}]
)
hypothetical_doc = response.content[0].text
# Step 2: Search using the hypothetical document as the query
# The hypothetical doc is closer in embedding space to real docs
results = collection.query(
query_texts=[hypothetical_doc],
n_results=top_k
)
return results
Step-Back Prompting
# Step-back: Ask a broader question first for better context
def step_back_retrieve(query: str, collection, top_k=5):
# Generate a more general version of the question
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=100,
messages=[{
"role": "user",
"content": f"""Given this specific question, generate a broader
"step-back" question that would help provide context.
Specific: {query}
Step-back question:"""
}]
)
broad_query = response.content[0].text.strip()
# Retrieve for both queries
specific_results = collection.query(query_texts=[query], n_results=top_k)
broad_results = collection.query(query_texts=[broad_query], n_results=top_k)
# Combine and deduplicate
return merge_results(specific_results, broad_results)
Query Decomposition
# Decompose complex queries into sub-questions
def decompose_and_retrieve(query: str, collection, top_k=3):
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=300,
messages=[{
"role": "user",
"content": f"""Break this complex question into 2-4 simpler
sub-questions that can be answered independently.
Return one question per line.
Complex question: {query}"""
}]
)
sub_questions = response.content[0].text.strip().split("\n")
all_chunks = []
for sub_q in sub_questions:
results = collection.query(query_texts=[sub_q.strip()], n_results=top_k)
all_chunks.extend(zip(results["documents"][0], results["ids"][0]))
# Deduplicate by ID
seen = set()
unique_chunks = []
for text, doc_id in all_chunks:
if doc_id not in seen:
seen.add(doc_id)
unique_chunks.append(text)
return unique_chunks
Self-RAG: Adaptive Retrieval
Self-RAG lets the model decide whether retrieval is even necessary, and critically evaluates the relevance and support of retrieved documents.
# Self-RAG: The model decides when and how to retrieve
def self_rag(query: str, collection) -> str:
# Step 1: Does this query need retrieval?
need_retrieval = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=10,
messages=[{
"role": "user",
"content": f"""Does answering this question require looking up
specific factual information? Answer YES or NO.
Question: {query}"""
}]
)
if "NO" in need_retrieval.content[0].text:
# Answer directly without retrieval
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": query}]
)
return response.content[0].text
# Step 2: Retrieve
results = collection.query(query_texts=[query], n_results=5)
chunks = results["documents"][0]
# Step 3: Evaluate relevance of each chunk
relevant_chunks = []
for chunk in chunks:
relevance = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=10,
messages=[{
"role": "user",
"content": f"""Is this document relevant to the question?
Document: {chunk[:500]}
Question: {query}
Answer RELEVANT or IRRELEVANT:"""
}]
)
if "RELEVANT" in relevance.content[0].text:
relevant_chunks.append(chunk)
if not relevant_chunks:
return "I could not find relevant information to answer this question."
# Step 4: Generate with validated context
context = "\n\n---\n\n".join(relevant_chunks)
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Answer based on the verified context below.
Context:\n{context}\n\nQuestion: {query}"""
}]
)
return response.content[0].text
Complete RAG Pipeline with LangChain
# Production-ready RAG with LangChain + Claude + ChromaDB
from langchain_anthropic import ChatAnthropic
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.document_loaders import DirectoryLoader
# 1. Load documents
loader = DirectoryLoader("./docs/", glob="**/*.md")
documents = loader.load()
print(f"Loaded {len(documents)} documents")
# 2. Chunk documents
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
# 3. Create vector store
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
# 4. Create retriever with search kwargs
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance for diversity
search_kwargs={"k": 5, "fetch_k": 20}
)
# 5. Define prompt template
prompt = PromptTemplate(
template="""You are a helpful assistant. Answer the question based ONLY
on the following context. If the context does not contain the answer,
say "I don't have enough information to answer this."
Always cite your sources.
Context:
{context}
Question: {question}
Answer:""",
input_variables=["context", "question"]
)
# 6. Create chain
llm = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True
)
# 7. Query
result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])
for doc in result["source_documents"]:
print(f" Source: {doc.metadata['source']}")
RAG Anti-Patterns (Common Mistakes)
Avoid these common RAG mistakes:
- Chunks too large (2000+ tokens): Large chunks dilute relevance. The retrieved chunk contains your answer but also 80% irrelevant text that confuses the LLM.
- Chunks too small (50-100 tokens): Small chunks lose context. A sentence fragment about "the refund policy" is useless without knowing which product it applies to.
- No overlap between chunks: Sentences that span chunk boundaries get cut in half. Always use 10-20% overlap.
- Skipping reranking: Vector search is approximate. Reranking with a cross-encoder improves top-5 precision by 15-30% with minimal latency cost.
- Ignoring metadata: Storing only text without source, date, author, or section info makes it impossible to filter, cite, or debug.
- Not testing retrieval separately: Measure retrieval quality (recall@k, precision@k) independently from generation quality. Poor retrieval cannot be fixed by a better LLM.
- Stuffing too many chunks: Including 20 chunks in the prompt fills the context window with noise. 3-5 high-quality chunks outperform 20 mediocre ones.
When NOT to Use RAG
RAG is not always the right pattern. Do not use RAG when:
- The LLM already knows the answer: For general knowledge questions (e.g., "explain photosynthesis"), RAG adds latency and cost without improving quality.
- Your data fits in the context window: If your entire knowledge base is under 100K tokens, just include it all in the prompt. No retrieval needed.
- You need real-time data: RAG works with indexed data. If you need live database queries or API calls, use an Agent pattern with tool use instead.
- The task is creative: Writing fiction, brainstorming ideas, or generating creative content does not benefit from document retrieval.
- You need structured data queries: "Show me all orders over $500 from last month" requires SQL, not semantic search. Use text-to-SQL instead of RAG.
RAG Evaluation Metrics
| Metric | What It Measures | Target |
|---|---|---|
| Retrieval Recall@k | % of relevant docs found in top-k results | > 85% |
| Retrieval Precision@k | % of top-k results that are actually relevant | > 70% |
| Answer Faithfulness | Is the answer supported by retrieved context? | > 90% |
| Answer Relevance | Does the answer address the question asked? | > 85% |
| Context Relevance | Are retrieved chunks relevant to the question? | > 75% |
| Hallucination Rate | % of responses with unsupported claims | < 5% |
What's Next
In the next lesson, we explore the Agent and ReAct pattern — how to build AI systems that can reason about tasks, use tools, and take multi-step actions autonomously. You will learn the observe-think-act loop, tool calling, multi-agent architectures, and when agents are the right choice versus simpler patterns like prompt chaining.
Lilly Tech Systems