Intermediate

Step 2: Keyword Search (BM25)

In this lesson, you will implement traditional keyword search using Elasticsearch BM25. You will build multi-field queries with boosting, add phrase matching, configure filters, and return highlighted snippets. This forms the lexical foundation of our hybrid search engine.

How BM25 Works

BM25 (Best Matching 25) is the default scoring algorithm in Elasticsearch. It ranks documents based on term frequency and inverse document frequency:

Term Frequency (TF): How often the query term appears in the document. More occurrences mean higher relevance, with diminishing returns.
Inverse Document Frequency (IDF): How rare the term is across all documents. Rare terms contribute more to relevance than common ones.
Document Length Normalization: Shorter documents with the same term frequency score higher, because the term is proportionally more important.

BM25 Score = IDF(q) * [ TF(q,d) * (k1 + 1) / (TF(q,d) + k1 * (1 - b + b * |d| / avgdl)) ]

Where:
  - k1 = 1.2 (term frequency saturation)
  - b  = 0.75 (length normalization factor)
  - |d| = document length
  - avgdl = average document length

💡

When BM25 Shines: Keyword search excels at exact term matching, proper nouns, technical jargon, error codes, and product names. A search for "FastAPI uvicorn" should find exact matches, not conceptually similar pages about "web frameworks." This is why hybrid search combines both approaches.

The Keyword Search Module

Create the keyword search module with multi-field matching, boosting, filtering, and highlighting:

# app/search/keyword.py
"""BM25 keyword search with multi-field matching and highlighting."""
from app.elasticsearch.client import SearchClient
from app.config import get_settings
import logging

logger = logging.getLogger(__name__)
settings = get_settings()


def keyword_search(
    query: str,
    top_k: int = None,
    category: str = None,
    tags: list[str] = None,
    from_offset: int = 0
) -> dict:
    """Run a BM25 keyword search against Elasticsearch.

    Args:
        query: The user's search query string.
        top_k: Number of results to return (default from settings).
        category: Optional category filter.
        tags: Optional tag filters (AND logic).
        from_offset: Pagination offset.

    Returns:
        Dict with 'results' list and 'total' count.
    """
    if top_k is None:
        top_k = settings.search_top_k

    client = SearchClient()

    # Build the query
    must_clauses = [
        {
            "multi_match": {
                "query": query,
                "fields": [
                    "title^3",       # Title matches are 3x more important
                    "title.keyword^5",  # Exact title match is 5x
                    "body"
                ],
                "type": "best_fields",
                "fuzziness": "AUTO",
                "prefix_length": 2,
                "minimum_should_match": "75%"
            }
        }
    ]

    # Optional: add phrase boost for exact phrase matches
    should_clauses = [
        {
            "match_phrase": {
                "title": {
                    "query": query,
                    "boost": 10,
                    "slop": 1
                }
            }
        },
        {
            "match_phrase": {
                "body": {
                    "query": query,
                    "boost": 2,
                    "slop": 2
                }
            }
        }
    ]

    # Build filter clauses
    filter_clauses = []
    if category:
        filter_clauses.append({"term": {"category": category}})
    if tags:
        for tag in tags:
            filter_clauses.append({"term": {"tags": tag}})

    # Assemble the full query
    search_body = {
        "query": {
            "bool": {
                "must": must_clauses,
                "should": should_clauses,
                "filter": filter_clauses
            }
        },
        "highlight": {
            "fields": {
                "title": {
                    "number_of_fragments": 1,
                    "fragment_size": 200
                },
                "body": {
                    "number_of_fragments": 3,
                    "fragment_size": 150
                }
            },
            "pre_tags": ["<mark>"],
            "post_tags": ["</mark>"]
        },
        "from": from_offset,
        "size": top_k,
        "_source": ["title", "body", "category", "tags", "url", "created_at"]
    }

    # Execute search
    response = client.es.search(
        index=client.index_name,
        body=search_body
    )

    # Parse results
    results = []
    for hit in response["hits"]["hits"]:
        result = {
            "id": hit["_id"],
            "score": hit["_score"],
            "source": hit["_source"],
            "highlights": hit.get("highlight", {})
        }
        results.append(result)

    total = response["hits"]["total"]["value"]

    logger.info(f"Keyword search for '{query}': {total} total, returning {len(results)}")

    return {
        "results": results,
        "total": total,
        "query": query,
        "mode": "keyword"
    }

Understanding the Query Structure

Let us break down the key decisions in the query:

Multi-Match with Field Boosting

"multi_match": {
    "query": "machine learning",
    "fields": [
        "title^3",           # Title match worth 3x body match
        "title.keyword^5",   # Exact title match worth 5x
        "body"               # Body match (base score)
    ],
    "type": "best_fields",   # Take the highest-scoring field
    "fuzziness": "AUTO"      # Allow typos: 1 edit for 3-5 chars, 2 for 6+
}

The best_fields type takes the score from whichever field matched best. This prevents dilution when a term appears in the title but not the body.

Phrase Boosting

"match_phrase": {
    "title": {
        "query": "machine learning",
        "boost": 10,
        "slop": 1    # Allow 1 word between terms
    }
}

When the query appears as an exact phrase (or near-phrase) in the title, the document gets a large boost. "Introduction to Machine Learning" scores much higher than a document that just mentions "machine" and "learning" separately.

Highlighting

"highlight": {
    "fields": {
        "body": {
            "number_of_fragments": 3,   // Return up to 3 highlighted snippets
            "fragment_size": 150         // Each snippet is ~150 characters
        }
    },
    "pre_tags": ["<mark>"],
    "post_tags": ["</mark>"]
}

Highlighting wraps matched terms in <mark> tags so the UI can visually emphasize them. The fragment_size controls how much context surrounds each match.

Add the Search API Route

Add the keyword search endpoint to app/main.py:

# Add to app/main.py
from app.search.keyword import keyword_search


@app.get("/api/search")
async def search(
    q: str,
    mode: str = "keyword",
    top_k: int = 10,
    category: str = None,
    tags: str = None,
    page: int = 1
):
    """Search documents with keyword (BM25) matching.

    Query params:
        q: Search query string
        mode: Search mode - 'keyword' for now, 'semantic' and 'hybrid' later
        top_k: Number of results per page
        category: Filter by category
        tags: Comma-separated tag filters
        page: Page number (1-based)
    """
    tag_list = tags.split(",") if tags else None
    from_offset = (page - 1) * top_k

    if mode == "keyword":
        return keyword_search(
            query=q,
            top_k=top_k,
            category=category,
            tags=tag_list,
            from_offset=from_offset
        )

    return {"error": f"Mode '{mode}' not implemented yet"}

Test Keyword Search

# Basic keyword search
curl "http://localhost:8000/api/search?q=machine+learning&mode=keyword"

# With category filter
curl "http://localhost:8000/api/search?q=python&category=web-development"

# With tag filter
curl "http://localhost:8000/api/search?q=database&tags=vectors,embeddings"

# Pagination
curl "http://localhost:8000/api/search?q=AI&page=2&top_k=5"

Expected response structure:

{
  "results": [
    {
      "id": "a1b2c3d4",
      "score": 8.234,
      "source": {
        "title": "Introduction to Machine Learning",
        "body": "Machine learning is a subset of artificial intelligence...",
        "category": "machine-learning",
        "tags": ["ml", "ai", "beginner"],
        "url": "/articles/intro-ml"
      },
      "highlights": {
        "title": ["Introduction to <mark>Machine Learning</mark>"],
        "body": ["<mark>Machine learning</mark> is a subset of artificial intelligence..."]
      }
    }
  ],
  "total": 1,
  "query": "machine learning",
  "mode": "keyword"
}

Relevance Tuning Tips

Increase title boost if users typically search for document names or headings.
Lower minimum_should_match from 75% to 50% if you want broader recall at the cost of precision.
Disable fuzziness for technical queries where exact terms matter (code, IDs, error messages).
Add field-specific analyzers for different content types (e.g., a code analyzer that preserves underscores and dots).
Use function_score to boost recent documents or popular documents.

Key Takeaways

BM25 keyword search is fast, explainable, and excellent at exact term matching.
Multi-field queries with boosting let you weight title matches higher than body matches.
Phrase boosting dramatically improves precision for multi-word queries.
Highlighting with <mark> tags gives users visual feedback on why each result matched.
Filters (category, tags) narrow results without affecting relevance scores.

What Is Next

Keyword search cannot find conceptually similar documents when different words are used. In the next lesson, you will add semantic search with dense vectors so "automobile maintenance" finds "car repair" even when no keywords overlap.

← Previous Data Indexing Pipeline Next → Semantic Search