Retrieval & Reranking
Advanced retrieval strategies that go beyond simple similarity search to dramatically improve RAG quality.
Retrieval Strategies
1. Basic Similarity Search
# Simple top-k similarity search retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 5} ) docs = retriever.invoke("How do I deploy to production?")
2. MMR (Maximal Marginal Relevance)
MMR balances relevance with diversity. It prevents returning multiple chunks that say the same thing:
# MMR: diverse results that are still relevant retriever = vectorstore.as_retriever( search_type="mmr", search_kwargs={ "k": 5, "fetch_k": 20, # Fetch 20 candidates "lambda_mult": 0.7 # 0=max diversity, 1=max relevance } )
3. Multi-Query Retrieval
Generate multiple variations of the user's query to retrieve a broader set of relevant documents:
from langchain.retrievers.multi_query import MultiQueryRetriever from langchain_anthropic import ChatAnthropic llm = ChatAnthropic(model="claude-sonnet-4-20250514") # LLM generates 3 query variations, retrieves for each retriever = MultiQueryRetriever.from_llm( retriever=vectorstore.as_retriever(search_kwargs={"k": 5}), llm=llm ) # Original: "How do I deploy?" # Generated: "What are the deployment steps?" # "How to push to production?" # "Deployment guide and instructions" docs = retriever.invoke("How do I deploy?")
Reranking
Reranking takes the initial retrieval results and re-scores them using a more powerful (but slower) model. This dramatically improves precision.
Cross-Encoder Reranking
from langchain.retrievers import ContextualCompressionRetriever from langchain_cohere import CohereRerank # Step 1: Retrieve broadly (top 20) base_retriever = vectorstore.as_retriever( search_kwargs={"k": 20} ) # Step 2: Rerank to find the best 5 reranker = CohereRerank( model="rerank-english-v3.0", top_n=5 ) retriever = ContextualCompressionRetriever( base_compressor=reranker, base_retriever=base_retriever ) # Results are much more relevant than simple top-5 docs = retriever.invoke("How do I handle authentication?")
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer to the question, embed that, and use it for retrieval. This bridges the gap between question embeddings and document embeddings:
from langchain.chains import HypotheticalDocumentEmbedder # LLM generates a hypothetical answer hyde_embeddings = HypotheticalDocumentEmbedder.from_llm( llm=llm, base_embeddings=embeddings, prompt_key="web_search" ) # User asks: "What causes memory leaks in Node.js?" # HyDE generates: "Memory leaks in Node.js are commonly # caused by unclosed event listeners, global variables..." # This hypothetical doc is embedded and used for search # Result: retrieves actual docs about Node.js memory leaks
Ensemble Retriever
Combine results from multiple retrieval strategies:
from langchain.retrievers import EnsembleRetriever from langchain_community.retrievers import BM25Retriever # Keyword-based retriever bm25_retriever = BM25Retriever.from_documents(chunks, k=5) # Vector-based retriever vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) # Combine with weights ensemble = EnsembleRetriever( retrievers=[bm25_retriever, vector_retriever], weights=[0.4, 0.6] # 40% keyword, 60% semantic ) docs = ensemble.invoke("error code E1234") # BM25 catches the exact error code # Vector catches semantically related troubleshooting docs
Query Transformation
Transform the user's query before retrieval to improve results:
| Technique | How It Works | When to Use |
|---|---|---|
| Query Rewriting | LLM rewrites the query for better retrieval | Vague or conversational queries |
| Step-Back Prompting | Generates a more general query first | Very specific questions |
| Sub-Question Decomposition | Breaks complex questions into parts | Multi-part questions |
| HyDE | Generates hypothetical answer to embed | Questions vs. document mismatch |
What's Next?
The next lesson covers generation — how to construct prompts with retrieved context and generate high-quality, cited answers.
Lilly Tech Systems