Advanced Retrieval Strategies
Naive top-K retrieval gets you 70% accuracy. Advanced retrieval strategies — multi-query, HyDE, re-ranking, and parent-child retrieval — push that to 90%+. This lesson covers the techniques that separate demo RAG from production RAG.
Why Naive Retrieval Fails
The basic pattern of "embed the query, find top-K similar chunks" has three fundamental limitations:
- Query-document mismatch: Users ask questions ("How do I reset my password?") but your chunks contain statements ("To reset your password, navigate to Settings..."). The question embedding and the answer embedding may not be close in vector space.
- Single perspective: A single query embedding captures only one interpretation. The query "Python performance" could mean language speed, code optimization, or the Python runtime benchmarks.
- Chunk granularity: Small chunks rank well on relevance but lack context. Large chunks contain context but dilute relevance scores.
Strategy 1: Multi-Query Retrieval
Generate multiple variations of the user's query using an LLM, retrieve results for each variation, and merge the results. This captures different perspectives on the same question.
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)
vectorstore = Qdrant(client=qdrant_client, collection_name="docs",
embeddings=OpenAIEmbeddings())
retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
llm=llm
)
# User asks: "How do I handle authentication errors?"
# The LLM generates 3 query variations:
# 1. "What causes authentication failures in the API?"
# 2. "How to debug 401 unauthorized errors"
# 3. "Authentication error handling best practices"
# Results from all 3 queries are merged and deduplicated
docs = retriever.get_relevant_documents("How do I handle authentication errors?")
Strategy 2: Hypothetical Document Embeddings (HyDE)
Instead of embedding the question, ask the LLM to generate a hypothetical answer, then embed that answer to search for similar real documents. This solves the query-document mismatch problem because you are now searching document space with a document-like vector.
from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
llm = ChatOpenAI(model="gpt-4o-mini")
base_embeddings = OpenAIEmbeddings()
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
llm=llm,
base_embeddings=base_embeddings,
prompt_key="web_search" # or custom prompt
)
# User asks: "What is the refund policy for enterprise plans?"
# HyDE generates: "Our enterprise refund policy allows customers to request
# a full refund within 30 days of purchase. After 30 days, a prorated
# refund is available..."
# This hypothetical doc is embedded and used for vector search
# The retrieved chunks will be REAL documents about refund policies
vectorstore = Qdrant(client=qdrant_client, collection_name="docs",
embeddings=hyde_embeddings)
docs = vectorstore.similarity_search("What is the refund policy for enterprise plans?")
Strategy 3: Re-Ranking with Cross-Encoders
Bi-encoder embedding search is fast but imprecise. Cross-encoders are slower but much more accurate at judging relevance because they process the query and document together (not independently). The production pattern is: retrieve 20–50 candidates with fast vector search, then re-rank the top candidates with a cross-encoder to get the best 5.
# Step 1: Fast retrieval (bi-encoder) - get 20 candidates
candidates = vectorstore.similarity_search(query, k=20)
# Step 2: Re-rank with cross-encoder - keep top 5
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
# Score each (query, document) pair
pairs = [(query, doc.page_content) for doc in candidates]
scores = reranker.predict(pairs)
# Sort by cross-encoder score and keep top 5
ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
top_docs = [doc for score, doc in ranked[:5]]
# Using Cohere Rerank API (managed, no GPU needed)
import cohere
co = cohere.Client("your-api-key")
results = co.rerank(
model="rerank-english-v3.0",
query="How do I handle authentication errors?",
documents=[doc.page_content for doc in candidates],
top_n=5
)
# results[0].index = index of best matching document
# results[0].relevance_score = 0.0 to 1.0
cross-encoder/ms-marco-MiniLM-L-12-v2 if you need to avoid external APIs.Strategy 4: Contextual Compression
Instead of passing entire retrieved chunks to the LLM, extract only the relevant sentences from each chunk. This reduces prompt size and focuses the generator on the most relevant information.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)
# Retrieves 10 chunks, then extracts only the sentences
# relevant to the query from each chunk
docs = compression_retriever.get_relevant_documents(
"What are the SLA guarantees for the enterprise tier?"
)
# Each doc now contains only the relevant sentences, not the full chunk
Strategy 5: Parent-Child Document Retrieval
Index small chunks for precise retrieval, but return the larger parent document for full context. This solves the granularity tradeoff: search on small chunks (high precision) but generate from large chunks (full context).
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Small chunks for search (high precision)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# Large chunks for context (full information)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
store = InMemoryStore() # Use Redis or a database in production
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
# At ingestion: splits into 1000-token parents, then 200-token children
# Children are embedded and indexed in the vector store
# Parents are stored in the docstore
retriever.add_documents(documents)
# At query time: searches children, returns parents
docs = retriever.get_relevant_documents("authentication error handling")
# Returns the 1000-token parent chunks that contain the matching 200-token children
Combining Strategies: The Production Retrieval Stack
In production, you combine multiple strategies into a retrieval pipeline. Here is the recommended stack, ordered by impact:
| Layer | Strategy | Impact | Added Cost | Added Latency |
|---|---|---|---|---|
| 1 | Hybrid search (vector + BM25) | +10–15% recall | $0 | ~2ms |
| 2 | Re-ranking (cross-encoder) | +15–25% precision | $0.001/query | ~50ms |
| 3 | Multi-query | +10–15% for ambiguous queries | $0.001/query | ~200ms |
| 4 | Parent-child retrieval | Better context in generation | $0 (storage cost only) | ~5ms |
| 5 | Contextual compression | -30% prompt tokens | $0.001/query | ~300ms |
Key Takeaways
- Naive top-K retrieval has three failure modes: query-document mismatch, single perspective, and chunk granularity tradeoff.
- Multi-query retrieval captures multiple interpretations of ambiguous questions.
- HyDE bridges the gap between question embeddings and document embeddings.
- Re-ranking with cross-encoders is the single highest-impact improvement for retrieval precision.
- Parent-child retrieval lets you search on small chunks but generate from large context.
- In production, stack hybrid search + re-ranking first, then add other strategies based on evaluation data.
Lilly Tech Systems