Step 4: Hybrid Search & Re-ranking
This is the core of the AI search engine. You will combine BM25 keyword scores with semantic vector scores using Reciprocal Rank Fusion (RRF), then apply a cross-encoder model to re-rank the top candidates for maximum precision. The result is a search system that outperforms either approach alone.
Why Hybrid Search?
We saw in previous lessons that keyword and semantic search have complementary strengths:
- BM25 excels at exact matches: "FastAPI", error codes, product names.
- Semantic excels at meaning: "how to build web apps" finds "FastAPI tutorial."
- Hybrid gets both: exact matches rank high AND conceptually similar documents appear.
Research consistently shows that hybrid search outperforms either approach individually. The question is how to combine the scores.
Reciprocal Rank Fusion (RRF)
RRF is a simple, effective algorithm for combining ranked lists from different scoring systems. It does not require score normalization because it only uses rank positions:
RRF Score = sum( 1 / (k + rank_i) ) for each ranking system i
Where:
- k = 60 (constant that prevents top-ranked items from dominating)
- rank_i = position in ranking system i (1-based)
Example for document D:
BM25 rank: 3 -> 1/(60+3) = 0.01587
Semantic rank: 1 -> 1/(60+1) = 0.01639
RRF Score: 0.01587 + 0.01639 = 0.03226
The Hybrid Search Module
Create the hybrid search module with RRF fusion:
# app/search/hybrid.py
"""Hybrid search combining BM25 and semantic search with RRF fusion."""
from app.search.keyword import keyword_search
from app.search.semantic import semantic_search
from app.config import get_settings
import logging
logger = logging.getLogger(__name__)
settings = get_settings()
def reciprocal_rank_fusion(
keyword_results: list[dict],
semantic_results: list[dict],
k: int = 60
) -> list[dict]:
"""Combine two ranked lists using Reciprocal Rank Fusion.
Args:
keyword_results: Results from BM25 keyword search.
semantic_results: Results from semantic vector search.
k: RRF constant (default 60, from the original paper).
Returns:
Merged and re-ranked list of results with RRF scores.
"""
# Build a map of document_id -> result data + scores
doc_map = {}
# Process keyword results
for rank, result in enumerate(keyword_results, start=1):
doc_id = result["id"]
rrf_score = 1.0 / (k + rank)
if doc_id not in doc_map:
doc_map[doc_id] = {
"id": doc_id,
"source": result["source"],
"highlights": result.get("highlights", {}),
"rrf_score": 0.0,
"keyword_rank": rank,
"keyword_score": result["score"],
"semantic_rank": None,
"semantic_score": None
}
doc_map[doc_id]["rrf_score"] += rrf_score
doc_map[doc_id]["keyword_rank"] = rank
doc_map[doc_id]["keyword_score"] = result["score"]
# Process semantic results
for rank, result in enumerate(semantic_results, start=1):
doc_id = result["id"]
rrf_score = 1.0 / (k + rank)
if doc_id not in doc_map:
doc_map[doc_id] = {
"id": doc_id,
"source": result["source"],
"highlights": result.get("highlights", {}),
"rrf_score": 0.0,
"keyword_rank": None,
"keyword_score": None,
"semantic_rank": rank,
"semantic_score": None
}
doc_map[doc_id]["rrf_score"] += rrf_score
doc_map[doc_id]["semantic_rank"] = rank
doc_map[doc_id]["semantic_score"] = result["score"]
# Sort by RRF score descending
fused = sorted(doc_map.values(), key=lambda x: x["rrf_score"], reverse=True)
return fused
def hybrid_search(
query: str,
top_k: int = None,
category: str = None,
tags: list[str] = None,
rrf_k: int = 60,
candidate_multiplier: int = 3
) -> dict:
"""Run hybrid search: BM25 + semantic + RRF fusion.
Args:
query: The user's search query.
top_k: Final number of results to return.
category: Optional category filter.
tags: Optional tag filters.
rrf_k: RRF constant (default 60).
candidate_multiplier: Fetch N * top_k candidates from each system.
Returns:
Dict with fused 'results' list and metadata.
"""
if top_k is None:
top_k = settings.search_top_k
# Fetch more candidates than needed for better fusion
fetch_k = top_k * candidate_multiplier
# Run both searches in parallel (could use asyncio.gather in async version)
kw_results = keyword_search(
query=query, top_k=fetch_k, category=category, tags=tags
)
sem_results = semantic_search(
query=query, top_k=fetch_k, category=category, tags=tags
)
# Fuse with RRF
fused = reciprocal_rank_fusion(
kw_results["results"],
sem_results["results"],
k=rrf_k
)
# Take top_k results
results = fused[:top_k]
# Convert to standard result format
formatted = []
for item in results:
formatted.append({
"id": item["id"],
"score": item["rrf_score"],
"source": item["source"],
"highlights": item["highlights"],
"ranking_details": {
"keyword_rank": item["keyword_rank"],
"keyword_score": item["keyword_score"],
"semantic_rank": item["semantic_rank"],
"semantic_score": item["semantic_score"]
}
})
total = max(kw_results["total"], sem_results["total"])
logger.info(
f"Hybrid search for '{query}': "
f"keyword={len(kw_results['results'])}, "
f"semantic={len(sem_results['results'])}, "
f"fused={len(formatted)}"
)
return {
"results": formatted,
"total": total,
"query": query,
"mode": "hybrid"
}
Cross-Encoder Re-ranking
RRF gives us a good initial ranking, but we can improve precision further with a cross-encoder. Unlike bi-encoders (sentence-transformers) that encode query and document separately, a cross-encoder processes the query-document pair together, enabling deeper interaction between them:
Bi-encoder (fast, used for retrieval):
encode("search query") -> [0.12, -0.34, ...]
encode("document text") -> [0.45, 0.23, ...]
score = cosine(query_vec, doc_vec)
Cross-encoder (slow, used for re-ranking):
score = model("search query", "document text") -> 8.73
// Processes both texts together through all transformer layers
// Much more accurate, but too slow for initial retrieval
Create the re-ranking module:
# app/search/reranker.py
"""Cross-encoder re-ranking for search results."""
from sentence_transformers import CrossEncoder
from app.config import get_settings
import logging
logger = logging.getLogger(__name__)
settings = get_settings()
# Load model once at module level
_reranker = None
def get_reranker() -> CrossEncoder:
"""Lazy-load the cross-encoder re-ranking model."""
global _reranker
if _reranker is None:
logger.info(f"Loading re-ranker model: {settings.reranker_model}")
_reranker = CrossEncoder(settings.reranker_model)
logger.info("Re-ranker model loaded")
return _reranker
def rerank_results(
query: str,
results: list[dict],
top_k: int = None
) -> list[dict]:
"""Re-rank search results using a cross-encoder model.
The cross-encoder scores each (query, document) pair for relevance.
This is more accurate than bi-encoder similarity but slower,
so we only apply it to the top candidates from initial retrieval.
Args:
query: The original search query.
results: List of result dicts from hybrid/keyword/semantic search.
top_k: Number of results to return after re-ranking.
Returns:
Re-ranked list of results with updated scores.
"""
if not results:
return results
if top_k is None:
top_k = settings.search_top_k
reranker = get_reranker()
# Build query-document pairs for the cross-encoder
pairs = []
for result in results:
doc_text = result["source"].get("title", "") + ". " + result["source"].get("body", "")
# Truncate to 512 tokens (cross-encoder limit)
doc_text = doc_text[:2000]
pairs.append((query, doc_text))
# Score all pairs
scores = reranker.predict(pairs)
# Attach scores and sort
for result, score in zip(results, scores):
result["rerank_score"] = float(score)
result["original_score"] = result["score"]
result["score"] = float(score) # Replace score with rerank score
# Sort by cross-encoder score descending
reranked = sorted(results, key=lambda x: x["score"], reverse=True)
logger.info(
f"Re-ranked {len(results)} results. "
f"Top score: {reranked[0]['score']:.4f}, "
f"Bottom score: {reranked[-1]['score']:.4f}"
)
return reranked[:top_k]
Complete Hybrid + Re-ranking Pipeline
Update the search API to use the full pipeline:
# Update app/main.py - complete search endpoint
from app.search.keyword import keyword_search
from app.search.semantic import semantic_search
from app.search.hybrid import hybrid_search
from app.search.reranker import rerank_results
@app.get("/api/search")
async def search(
q: str,
mode: str = None,
top_k: int = 10,
category: str = None,
tags: str = None,
page: int = 1,
rerank: bool = True
):
"""Search documents with keyword, semantic, or hybrid matching.
Query params:
q: Search query string
mode: 'keyword', 'semantic', or 'hybrid' (default from settings)
top_k: Number of results per page
category: Filter by category
tags: Comma-separated tag filters
page: Page number (1-based)
rerank: Apply cross-encoder re-ranking (default True for hybrid)
"""
if mode is None:
mode = settings.search_default_mode
tag_list = tags.split(",") if tags else None
from_offset = (page - 1) * top_k
if mode == "keyword":
result = keyword_search(
query=q, top_k=top_k, category=category,
tags=tag_list, from_offset=from_offset
)
elif mode == "semantic":
result = semantic_search(
query=q, top_k=top_k, category=category, tags=tag_list
)
elif mode == "hybrid":
# Fetch more candidates for re-ranking
fetch_k = settings.reranker_top_k if rerank else top_k
result = hybrid_search(
query=q, top_k=fetch_k, category=category, tags=tag_list
)
# Apply cross-encoder re-ranking
if rerank and result["results"]:
result["results"] = rerank_results(
query=q, results=result["results"], top_k=top_k
)
result["reranked"] = True
else:
return {"error": f"Unknown search mode: {mode}"}
return result
Performance Characteristics
Search Mode Performance (approximate, 100K documents):
| Mode | Latency | Precision | Recall |
|-------------------|-----------|-----------|--------|
| Keyword (BM25) | 5-15 ms | High | Medium |
| Semantic (kNN) | 10-30 ms | Medium | High |
| Hybrid (RRF) | 20-50 ms | High | High |
| Hybrid + Rerank | 50-200 ms | Highest | High |
The cross-encoder adds 30-150 ms depending on the number of candidates.
Only re-rank the top 20 candidates for acceptable latency.
rerank=false parameter for autocomplete and instant results, but enable it for the main search results page where users expect higher quality.Test the Complete Pipeline
# Hybrid search with re-ranking (default)
curl "http://localhost:8000/api/search?q=how+to+build+web+applications&mode=hybrid"
# Hybrid without re-ranking (faster)
curl "http://localhost:8000/api/search?q=how+to+build+web+applications&mode=hybrid&rerank=false"
# Compare all three modes for the same query
curl "http://localhost:8000/api/search?q=vector+database&mode=keyword"
curl "http://localhost:8000/api/search?q=vector+database&mode=semantic"
curl "http://localhost:8000/api/search?q=vector+database&mode=hybrid"
# The hybrid results should include the best from both approaches
Key Takeaways
- Reciprocal Rank Fusion combines ranked lists without requiring score normalization. It is simple, robust, and dataset-agnostic.
- The RRF constant
k=60balances the influence of top-ranked vs lower-ranked items. - Cross-encoder re-ranking processes query-document pairs together for the highest accuracy, but is too slow for initial retrieval.
- The two-stage pipeline (retrieve then re-rank) gives us both speed and precision.
- Ranking details (keyword rank, semantic rank, re-rank score) are returned for debugging and transparency.
What Is Next
The search backend is complete. In the next lesson, you will build the search interface — a production-quality UI with autocomplete, faceted filters, highlighted snippets, and paginated results.
Lilly Tech Systems