Advanced

Retrieval & Reranking

Advanced retrieval strategies that go beyond simple similarity search to dramatically improve RAG quality.

Retrieval Strategies

1. Basic Similarity Search

Python - Similarity Search

# Simple top-k similarity search
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)
docs = retriever.invoke("How do I deploy to production?")

2. MMR (Maximal Marginal Relevance)

MMR balances relevance with diversity. It prevents returning multiple chunks that say the same thing:

Python - MMR Retrieval

# MMR: diverse results that are still relevant
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 5,
        "fetch_k": 20,       # Fetch 20 candidates
        "lambda_mult": 0.7   # 0=max diversity, 1=max relevance
    }
)

3. Multi-Query Retrieval

Generate multiple variations of the user's query to retrieve a broader set of relevant documents:

Python - Multi-Query

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-sonnet-4-20250514")

# LLM generates 3 query variations, retrieves for each
retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    llm=llm
)

# Original: "How do I deploy?"
# Generated: "What are the deployment steps?"
#            "How to push to production?"
#            "Deployment guide and instructions"
docs = retriever.invoke("How do I deploy?")

Reranking

Reranking takes the initial retrieval results and re-scores them using a more powerful (but slower) model. This dramatically improves precision.

Cross-Encoder Reranking

Python - Cross-Encoder Reranking

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# Step 1: Retrieve broadly (top 20)
base_retriever = vectorstore.as_retriever(
    search_kwargs={"k": 20}
)

# Step 2: Rerank to find the best 5
reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=5
)

retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever
)

# Results are much more relevant than simple top-5
docs = retriever.invoke("How do I handle authentication?")

✅

Reranking is the single biggest quality improvement you can add to a RAG system. Retrieve broadly (top 20-50), then rerank to the top 5. This consistently outperforms just retrieving top 5.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer to the question, embed that, and use it for retrieval. This bridges the gap between question embeddings and document embeddings:

Python - HyDE

from langchain.chains import HypotheticalDocumentEmbedder

# LLM generates a hypothetical answer
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm=llm,
    base_embeddings=embeddings,
    prompt_key="web_search"
)

# User asks: "What causes memory leaks in Node.js?"
# HyDE generates: "Memory leaks in Node.js are commonly
#   caused by unclosed event listeners, global variables..."
# This hypothetical doc is embedded and used for search
# Result: retrieves actual docs about Node.js memory leaks

Ensemble Retriever

Combine results from multiple retrieval strategies:

Python - Ensemble Retriever

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Keyword-based retriever
bm25_retriever = BM25Retriever.from_documents(chunks, k=5)

# Vector-based retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Combine with weights
ensemble = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # 40% keyword, 60% semantic
)

docs = ensemble.invoke("error code E1234")
# BM25 catches the exact error code
# Vector catches semantically related troubleshooting docs

Query Transformation

Transform the user's query before retrieval to improve results:

Technique	How It Works	When to Use
Query Rewriting	LLM rewrites the query for better retrieval	Vague or conversational queries
Step-Back Prompting	Generates a more general query first	Very specific questions
Sub-Question Decomposition	Breaks complex questions into parts	Multi-part questions
HyDE	Generates hypothetical answer to embed	Questions vs. document mismatch

What's Next?

The next lesson covers generation — how to construct prompts with retrieved context and generate high-quality, cited answers.

← Previous Vector Search Next → Generation