Intermediate

Advanced Retrieval Strategies

Naive top-K retrieval gets you 70% accuracy. Advanced retrieval strategies — multi-query, HyDE, re-ranking, and parent-child retrieval — push that to 90%+. This lesson covers the techniques that separate demo RAG from production RAG.

Why Naive Retrieval Fails

The basic pattern of "embed the query, find top-K similar chunks" has three fundamental limitations:

  • Query-document mismatch: Users ask questions ("How do I reset my password?") but your chunks contain statements ("To reset your password, navigate to Settings..."). The question embedding and the answer embedding may not be close in vector space.
  • Single perspective: A single query embedding captures only one interpretation. The query "Python performance" could mean language speed, code optimization, or the Python runtime benchmarks.
  • Chunk granularity: Small chunks rank well on relevance but lack context. Large chunks contain context but dilute relevance scores.

Strategy 1: Multi-Query Retrieval

Generate multiple variations of the user's query using an LLM, retrieve results for each variation, and merge the results. This captures different perspectives on the same question.

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)
vectorstore = Qdrant(client=qdrant_client, collection_name="docs",
                     embeddings=OpenAIEmbeddings())

retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    llm=llm
)

# User asks: "How do I handle authentication errors?"
# The LLM generates 3 query variations:
#   1. "What causes authentication failures in the API?"
#   2. "How to debug 401 unauthorized errors"
#   3. "Authentication error handling best practices"
# Results from all 3 queries are merged and deduplicated
docs = retriever.get_relevant_documents("How do I handle authentication errors?")
💡
Apply at work: Multi-query retrieval adds 1 LLM call (~$0.001) and 2 extra vector searches (~1ms each) per query. The accuracy improvement is typically 10–15% for ambiguous queries. Enable it for user-facing systems where question quality varies widely.

Strategy 2: Hypothetical Document Embeddings (HyDE)

Instead of embedding the question, ask the LLM to generate a hypothetical answer, then embed that answer to search for similar real documents. This solves the query-document mismatch problem because you are now searching document space with a document-like vector.

from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

llm = ChatOpenAI(model="gpt-4o-mini")
base_embeddings = OpenAIEmbeddings()

hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm=llm,
    base_embeddings=base_embeddings,
    prompt_key="web_search"  # or custom prompt
)

# User asks: "What is the refund policy for enterprise plans?"
# HyDE generates: "Our enterprise refund policy allows customers to request
#   a full refund within 30 days of purchase. After 30 days, a prorated
#   refund is available..."
# This hypothetical doc is embedded and used for vector search
# The retrieved chunks will be REAL documents about refund policies

vectorstore = Qdrant(client=qdrant_client, collection_name="docs",
                     embeddings=hyde_embeddings)
docs = vectorstore.similarity_search("What is the refund policy for enterprise plans?")
📝
When to use HyDE: HyDE works best when your knowledge base contains declarative content (manuals, policies, documentation) and users ask questions about it. It works poorly when the LLM generates a hypothetical answer that is factually wrong — the search will then find chunks similar to wrong information.

Strategy 3: Re-Ranking with Cross-Encoders

Bi-encoder embedding search is fast but imprecise. Cross-encoders are slower but much more accurate at judging relevance because they process the query and document together (not independently). The production pattern is: retrieve 20–50 candidates with fast vector search, then re-rank the top candidates with a cross-encoder to get the best 5.

# Step 1: Fast retrieval (bi-encoder) - get 20 candidates
candidates = vectorstore.similarity_search(query, k=20)

# Step 2: Re-rank with cross-encoder - keep top 5
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

# Score each (query, document) pair
pairs = [(query, doc.page_content) for doc in candidates]
scores = reranker.predict(pairs)

# Sort by cross-encoder score and keep top 5
ranked = sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)
top_docs = [doc for score, doc in ranked[:5]]
# Using Cohere Rerank API (managed, no GPU needed)
import cohere

co = cohere.Client("your-api-key")

results = co.rerank(
    model="rerank-english-v3.0",
    query="How do I handle authentication errors?",
    documents=[doc.page_content for doc in candidates],
    top_n=5
)

# results[0].index = index of best matching document
# results[0].relevance_score = 0.0 to 1.0
💡
Apply at work: Re-ranking is the single highest-impact retrieval improvement you can make. It typically improves accuracy by 15–25%. Use Cohere Rerank API ($1/1000 queries) if you want zero infrastructure. Self-host cross-encoder/ms-marco-MiniLM-L-12-v2 if you need to avoid external APIs.

Strategy 4: Contextual Compression

Instead of passing entire retrieved chunks to the LLM, extract only the relevant sentences from each chunk. This reduces prompt size and focuses the generator on the most relevant information.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)

# Retrieves 10 chunks, then extracts only the sentences
# relevant to the query from each chunk
docs = compression_retriever.get_relevant_documents(
    "What are the SLA guarantees for the enterprise tier?"
)
# Each doc now contains only the relevant sentences, not the full chunk

Strategy 5: Parent-Child Document Retrieval

Index small chunks for precise retrieval, but return the larger parent document for full context. This solves the granularity tradeoff: search on small chunks (high precision) but generate from large chunks (full context).

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Small chunks for search (high precision)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Large chunks for context (full information)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)

store = InMemoryStore()  # Use Redis or a database in production

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# At ingestion: splits into 1000-token parents, then 200-token children
# Children are embedded and indexed in the vector store
# Parents are stored in the docstore

retriever.add_documents(documents)

# At query time: searches children, returns parents
docs = retriever.get_relevant_documents("authentication error handling")
# Returns the 1000-token parent chunks that contain the matching 200-token children

Combining Strategies: The Production Retrieval Stack

In production, you combine multiple strategies into a retrieval pipeline. Here is the recommended stack, ordered by impact:

LayerStrategyImpactAdded CostAdded Latency
1 Hybrid search (vector + BM25) +10–15% recall $0 ~2ms
2 Re-ranking (cross-encoder) +15–25% precision $0.001/query ~50ms
3 Multi-query +10–15% for ambiguous queries $0.001/query ~200ms
4 Parent-child retrieval Better context in generation $0 (storage cost only) ~5ms
5 Contextual compression -30% prompt tokens $0.001/query ~300ms
💡
Apply at work: Start with layers 1 and 2 (hybrid search + re-ranking). These give you the biggest accuracy improvement with minimal cost and latency. Add multi-query and parent-child retrieval when your evaluation metrics plateau.

Key Takeaways

  • Naive top-K retrieval has three failure modes: query-document mismatch, single perspective, and chunk granularity tradeoff.
  • Multi-query retrieval captures multiple interpretations of ambiguous questions.
  • HyDE bridges the gap between question embeddings and document embeddings.
  • Re-ranking with cross-encoders is the single highest-impact improvement for retrieval precision.
  • Parent-child retrieval lets you search on small chunks but generate from large context.
  • In production, stack hybrid search + re-ranking first, then add other strategies based on evaluation data.