Advanced

Best Practices & Checklist

This final lesson distills everything into actionable metrics, testing strategies, and a production checklist. You will learn how to measure search quality, run A/B tests on ranking changes, and avoid the most common mistakes search engineers make.

Search Quality Metrics

You cannot improve what you do not measure. These are the metrics every search team must track:

Metric	What It Measures	Formula	Good Target
MRR	Mean Reciprocal Rank — how high the first relevant result appears	Average of 1/rank of first relevant result	> 0.5
NDCG@k	Normalized Discounted Cumulative Gain — overall ranking quality	DCG@k / ideal DCG@k (accounts for graded relevance)	> 0.6 at k=10
MAP	Mean Average Precision — precision across all recall levels	Mean of AP for each query (AP = area under P-R curve)	> 0.4
Zero-result rate	Percentage of queries that return no results	Queries with 0 results / total queries	< 5%
Click-through rate	Percentage of searches where user clicks a result	Searches with clicks / total searches	> 50%
p50/p95/p99 latency	Search response time at percentiles	Percentile of response time distribution	p50 < 50ms, p95 < 200ms

import numpy as np

def mrr(ranked_results: list, relevant_ids: set) -> float:
    """Mean Reciprocal Rank for a single query."""
    for rank, result in enumerate(ranked_results, 1):
        if result.doc_id in relevant_ids:
            return 1.0 / rank
    return 0.0

def ndcg_at_k(ranked_results: list, relevance_labels: dict, k: int = 10) -> float:
    """NDCG@k for a single query."""
    def dcg(scores, k):
        return sum(
            (2**score - 1) / np.log2(rank + 2)
            for rank, score in enumerate(scores[:k])
        )

    actual_scores = [relevance_labels.get(r.doc_id, 0) for r in ranked_results]
    ideal_scores = sorted(relevance_labels.values(), reverse=True)

    actual_dcg = dcg(actual_scores, k)
    ideal_dcg = dcg(ideal_scores, k)

    return actual_dcg / ideal_dcg if ideal_dcg > 0 else 0.0

def evaluate_search(queries: list, search_fn, ground_truth: dict) -> dict:
    """Run full evaluation across a query set."""
    mrr_scores, ndcg_scores = [], []

    for query_data in queries:
        query = query_data["query"]
        relevant = set(query_data["relevant_doc_ids"])
        labels = query_data.get("relevance_labels", {})

        results = search_fn(query)
        mrr_scores.append(mrr(results, relevant))
        if labels:
            ndcg_scores.append(ndcg_at_k(results, labels))

    return {
        "mrr": round(np.mean(mrr_scores), 4),
        "ndcg@10": round(np.mean(ndcg_scores), 4) if ndcg_scores else None,
        "num_queries": len(queries),
    }

A/B Testing Search Changes

Every ranking change must be validated with an A/B test. Here is the production pattern:

import hashlib
import random

class SearchABTest:
    """Route users to control or treatment search pipelines."""

    def __init__(self, test_name: str, treatment_pct: float = 0.1):
        self.test_name = test_name
        self.treatment_pct = treatment_pct

    def get_variant(self, user_id: str) -> str:
        """Deterministic assignment: same user always gets same variant."""
        hash_input = f"{self.test_name}:{user_id}"
        hash_val = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
        bucket = (hash_val % 1000) / 1000

        return "treatment" if bucket < self.treatment_pct else "control"

    def search(self, query, user_id, control_pipeline, treatment_pipeline):
        variant = self.get_variant(user_id)

        if variant == "treatment":
            results = treatment_pipeline.search(query)
        else:
            results = control_pipeline.search(query)

        # Log for analysis
        self._log_event({
            "test": self.test_name,
            "variant": variant,
            "user_id": user_id,
            "query": query,
            "result_ids": [r.doc_id for r in results[:10]],
        })

        return results

# Key A/B test metrics to track:
# - Click-through rate (CTR) per variant
# - Mean Reciprocal Rank (MRR) from clicks
# - Zero-result rate
# - Reformulation rate (user changes query)
# - Revenue per search (for e-commerce)
# Minimum sample: 1,000 searches per variant for meaningful results

💡

Apply at work: Never ship a ranking change without an A/B test. Even changes that improve offline NDCG can hurt user metrics. Start with 10% traffic to treatment, run for at least 1 week, and check for statistical significance (p < 0.05) before rolling out to 100%.

Relevance Tuning Process

Follow this systematic process to improve search relevance:

Step 1: Build Evaluation Set

Create 100+ query-result pairs with graded relevance labels (0=irrelevant, 1=partial, 2=relevant, 3=perfect). Include head queries (popular), torso queries (medium), and tail queries (rare). This is your ground truth.

Step 2: Baseline Metrics

Run evaluation and record MRR, NDCG@10, and zero-result rate. This is your starting point. Every change must be measured against this baseline. Log all results for comparison.

Step 3: Identify Failure Categories

Manually review the worst 20% of queries. Categorize failures: wrong entity resolution? Missing synonyms? Bad chunking? Ranking issue? Each category suggests a specific fix from this course.

Step 4: Implement and Measure

Make one change at a time. Re-run evaluation after each change. If NDCG improves by 2%+ on your evaluation set, A/B test it in production. If it regresses, revert. Track the impact of every change.

Production Deployment Checklist

Indexing (Lesson 2)

Document processing pipeline tested — handles PDFs, HTML, tables, and edge cases
Content hashing implemented — incremental indexing reduces costs by 80%+
Elasticsearch index mapping reviewed — correct field types, analyzers, and HNSW params
NRT indexing pipeline deployed — new content searchable within seconds if needed

Retrieval & Ranking (Lesson 3)

Hybrid search configured — BM25 + vector with RRF or tuned alpha
Cross-encoder re-ranking deployed — top 50–100 candidates re-ranked
Business rules layer implemented — sponsored, blocked, freshness boosts

Query Understanding (Lesson 4)

Spell correction with domain dictionary — brand names, product terms, tech terms added
Intent classification deployed — navigational, informational, transactional routing
Entity recognition extracting filters — brand, color, price from free text

Infrastructure (Lesson 6)

Cluster properly sized — dedicated masters, hot/warm data nodes, coordinating nodes
Caching layers deployed — result cache, embedding cache, ES request cache
Monitoring dashboard live — latency percentiles, error rate, cache hit rate, zero-result rate
Alerting configured — p95 latency spikes, error rate increases, indexing lag

Measurement (This Lesson)

Evaluation dataset created — 100+ queries with graded relevance labels
Baseline metrics recorded — MRR, NDCG@10, zero-result rate, CTR
A/B testing framework deployed — deterministic user assignment, metric logging
Click logging pipeline deployed — query, results, clicks, dwell time all captured

Frequently Asked Questions

Should I use Elasticsearch or a dedicated vector database? ▼

Use Elasticsearch if you need full-text search AND vector search in one system (most common case). Use a dedicated vector database (Pinecone, Qdrant, Weaviate) if vector search is your primary use case and you need advanced vector features like quantization or multi-tenancy. For most teams, Elasticsearch 8.x with native kNN search is the right starting point — it avoids managing two systems.

How many documents before I need to worry about scaling? ▼

A single Elasticsearch node with 64GB RAM can handle 10–50 million documents with sub-100ms latency. Vector search with HNSW needs about 7.5GB RAM per million 1536-d vectors. Start worrying about scaling at 10M+ documents or 1000+ queries per second. Before that, focus on search quality (relevance tuning) rather than infrastructure.

How do I handle multi-language search? ▼

For BM25: use language-specific analyzers in Elasticsearch (one field per language, or language-specific indices). For vector search: use a multilingual embedding model (Cohere embed-v4, multilingual-e5-large) that maps all languages to the same vector space. A query in English can then retrieve documents in French or Japanese. Test retrieval quality per language — it varies significantly.

What embedding model should I use? ▼

Start with OpenAI text-embedding-3-small (1536-d, $0.02/1M tokens) for general purpose. If cost is a concern, use text-embedding-3-small with reduced dimensions (256-d or 512-d). For open-source, nomic-embed-text or bge-large-en-v1.5 are strong options. Always benchmark against YOUR data — MTEB leaderboard scores do not always predict domain-specific performance.

How do I measure if my search is "good enough"? ▼

Track these metrics weekly: MRR > 0.5 (first relevant result in top 2 on average), zero-result rate < 5%, CTR > 50%, and p95 latency < 200ms. More importantly, review the bottom 10% of queries manually each week. If users consistently reformulate queries or leave without clicking, those queries are failing. Fix the failure categories one at a time.

When should I add personalization? ▼

Add personalization after your baseline relevance is solid (MRR > 0.5, NDCG > 0.6). Personalization on top of poor relevance makes things worse — you are personalizing bad results. Start with location (easy, high impact) and session context (no login required). Add historical personalization only when you have enough user data (1,000+ interactions per user segment).

How do I handle zero-result queries? ▼

Zero results is the worst search experience. Layer your fallback strategy: (1) spell-correct and retry, (2) relax filters and retry, (3) use semantic search with no filters, (4) show "did you mean?" suggestions, (5) show popular/trending results as a fallback. Log all zero-result queries and review them weekly to identify gaps in your index or query understanding.

What is the biggest mistake teams make with search? ▼

Not measuring. Teams build a search system, ship it, and never look at whether users are finding what they need. Without metrics (MRR, NDCG, CTR, zero-result rate), you are flying blind. The second biggest mistake is optimizing infrastructure (more shards, bigger clusters) when the problem is relevance (bad ranking, missing synonyms, no spell correction). Fix relevance first, then scale.

Course Summary

Week	Focus	Deliverable
1	Indexing + Hybrid Search	Documents indexed in Elasticsearch with BM25 + vector search working
2	Ranking Pipeline	Multi-stage ranking with cross-encoder re-ranking deployed
3	Query Understanding	Spell correction, intent detection, and entity extraction live
4	Evaluation	100+ query evaluation set, baseline metrics, A/B test framework
5	Optimization	Caching, latency optimization, monitoring dashboard deployed
6	Launch	Production deployment with A/B testing and click logging live

💡

Apply at work: Share this 6-week roadmap with your team as a project plan. It gives concrete deliverables per sprint. Most teams underestimate week 4 (evaluation) — without it, you cannot tell if your changes are helping or hurting.

← Previous Scaling Search Infrastructure Course Home → Designing AI Search