Step 2: Keyword Search (BM25)
In this lesson, you will implement traditional keyword search using Elasticsearch BM25. You will build multi-field queries with boosting, add phrase matching, configure filters, and return highlighted snippets. This forms the lexical foundation of our hybrid search engine.
How BM25 Works
BM25 (Best Matching 25) is the default scoring algorithm in Elasticsearch. It ranks documents based on term frequency and inverse document frequency:
- Term Frequency (TF): How often the query term appears in the document. More occurrences mean higher relevance, with diminishing returns.
- Inverse Document Frequency (IDF): How rare the term is across all documents. Rare terms contribute more to relevance than common ones.
- Document Length Normalization: Shorter documents with the same term frequency score higher, because the term is proportionally more important.
BM25 Score = IDF(q) * [ TF(q,d) * (k1 + 1) / (TF(q,d) + k1 * (1 - b + b * |d| / avgdl)) ]
Where:
- k1 = 1.2 (term frequency saturation)
- b = 0.75 (length normalization factor)
- |d| = document length
- avgdl = average document length
The Keyword Search Module
Create the keyword search module with multi-field matching, boosting, filtering, and highlighting:
# app/search/keyword.py
"""BM25 keyword search with multi-field matching and highlighting."""
from app.elasticsearch.client import SearchClient
from app.config import get_settings
import logging
logger = logging.getLogger(__name__)
settings = get_settings()
def keyword_search(
query: str,
top_k: int = None,
category: str = None,
tags: list[str] = None,
from_offset: int = 0
) -> dict:
"""Run a BM25 keyword search against Elasticsearch.
Args:
query: The user's search query string.
top_k: Number of results to return (default from settings).
category: Optional category filter.
tags: Optional tag filters (AND logic).
from_offset: Pagination offset.
Returns:
Dict with 'results' list and 'total' count.
"""
if top_k is None:
top_k = settings.search_top_k
client = SearchClient()
# Build the query
must_clauses = [
{
"multi_match": {
"query": query,
"fields": [
"title^3", # Title matches are 3x more important
"title.keyword^5", # Exact title match is 5x
"body"
],
"type": "best_fields",
"fuzziness": "AUTO",
"prefix_length": 2,
"minimum_should_match": "75%"
}
}
]
# Optional: add phrase boost for exact phrase matches
should_clauses = [
{
"match_phrase": {
"title": {
"query": query,
"boost": 10,
"slop": 1
}
}
},
{
"match_phrase": {
"body": {
"query": query,
"boost": 2,
"slop": 2
}
}
}
]
# Build filter clauses
filter_clauses = []
if category:
filter_clauses.append({"term": {"category": category}})
if tags:
for tag in tags:
filter_clauses.append({"term": {"tags": tag}})
# Assemble the full query
search_body = {
"query": {
"bool": {
"must": must_clauses,
"should": should_clauses,
"filter": filter_clauses
}
},
"highlight": {
"fields": {
"title": {
"number_of_fragments": 1,
"fragment_size": 200
},
"body": {
"number_of_fragments": 3,
"fragment_size": 150
}
},
"pre_tags": ["<mark>"],
"post_tags": ["</mark>"]
},
"from": from_offset,
"size": top_k,
"_source": ["title", "body", "category", "tags", "url", "created_at"]
}
# Execute search
response = client.es.search(
index=client.index_name,
body=search_body
)
# Parse results
results = []
for hit in response["hits"]["hits"]:
result = {
"id": hit["_id"],
"score": hit["_score"],
"source": hit["_source"],
"highlights": hit.get("highlight", {})
}
results.append(result)
total = response["hits"]["total"]["value"]
logger.info(f"Keyword search for '{query}': {total} total, returning {len(results)}")
return {
"results": results,
"total": total,
"query": query,
"mode": "keyword"
}
Understanding the Query Structure
Let us break down the key decisions in the query:
Multi-Match with Field Boosting
"multi_match": {
"query": "machine learning",
"fields": [
"title^3", # Title match worth 3x body match
"title.keyword^5", # Exact title match worth 5x
"body" # Body match (base score)
],
"type": "best_fields", # Take the highest-scoring field
"fuzziness": "AUTO" # Allow typos: 1 edit for 3-5 chars, 2 for 6+
}
The best_fields type takes the score from whichever field matched best. This prevents dilution when a term appears in the title but not the body.
Phrase Boosting
"match_phrase": {
"title": {
"query": "machine learning",
"boost": 10,
"slop": 1 # Allow 1 word between terms
}
}
When the query appears as an exact phrase (or near-phrase) in the title, the document gets a large boost. "Introduction to Machine Learning" scores much higher than a document that just mentions "machine" and "learning" separately.
Highlighting
"highlight": {
"fields": {
"body": {
"number_of_fragments": 3, // Return up to 3 highlighted snippets
"fragment_size": 150 // Each snippet is ~150 characters
}
},
"pre_tags": ["<mark>"],
"post_tags": ["</mark>"]
}
Highlighting wraps matched terms in <mark> tags so the UI can visually emphasize them. The fragment_size controls how much context surrounds each match.
Add the Search API Route
Add the keyword search endpoint to app/main.py:
# Add to app/main.py
from app.search.keyword import keyword_search
@app.get("/api/search")
async def search(
q: str,
mode: str = "keyword",
top_k: int = 10,
category: str = None,
tags: str = None,
page: int = 1
):
"""Search documents with keyword (BM25) matching.
Query params:
q: Search query string
mode: Search mode - 'keyword' for now, 'semantic' and 'hybrid' later
top_k: Number of results per page
category: Filter by category
tags: Comma-separated tag filters
page: Page number (1-based)
"""
tag_list = tags.split(",") if tags else None
from_offset = (page - 1) * top_k
if mode == "keyword":
return keyword_search(
query=q,
top_k=top_k,
category=category,
tags=tag_list,
from_offset=from_offset
)
return {"error": f"Mode '{mode}' not implemented yet"}
Test Keyword Search
# Basic keyword search
curl "http://localhost:8000/api/search?q=machine+learning&mode=keyword"
# With category filter
curl "http://localhost:8000/api/search?q=python&category=web-development"
# With tag filter
curl "http://localhost:8000/api/search?q=database&tags=vectors,embeddings"
# Pagination
curl "http://localhost:8000/api/search?q=AI&page=2&top_k=5"
Expected response structure:
{
"results": [
{
"id": "a1b2c3d4",
"score": 8.234,
"source": {
"title": "Introduction to Machine Learning",
"body": "Machine learning is a subset of artificial intelligence...",
"category": "machine-learning",
"tags": ["ml", "ai", "beginner"],
"url": "/articles/intro-ml"
},
"highlights": {
"title": ["Introduction to <mark>Machine Learning</mark>"],
"body": ["<mark>Machine learning</mark> is a subset of artificial intelligence..."]
}
}
],
"total": 1,
"query": "machine learning",
"mode": "keyword"
}
Relevance Tuning Tips
- Increase title boost if users typically search for document names or headings.
- Lower minimum_should_match from 75% to 50% if you want broader recall at the cost of precision.
- Disable fuzziness for technical queries where exact terms matter (code, IDs, error messages).
- Add field-specific analyzers for different content types (e.g., a code analyzer that preserves underscores and dots).
- Use function_score to boost recent documents or popular documents.
Key Takeaways
- BM25 keyword search is fast, explainable, and excellent at exact term matching.
- Multi-field queries with boosting let you weight title matches higher than body matches.
- Phrase boosting dramatically improves precision for multi-word queries.
- Highlighting with
<mark>tags gives users visual feedback on why each result matched. - Filters (category, tags) narrow results without affecting relevance scores.
What Is Next
Keyword search cannot find conceptually similar documents when different words are used. In the next lesson, you will add semantic search with dense vectors so "automobile maintenance" finds "car repair" even when no keywords overlap.
Lilly Tech Systems