Best Practices & Checklist
This final lesson distills everything into actionable metrics, testing strategies, and a production checklist. You will learn how to measure search quality, run A/B tests on ranking changes, and avoid the most common mistakes search engineers make.
Search Quality Metrics
You cannot improve what you do not measure. These are the metrics every search team must track:
| Metric | What It Measures | Formula | Good Target |
|---|---|---|---|
| MRR | Mean Reciprocal Rank — how high the first relevant result appears | Average of 1/rank of first relevant result | > 0.5 |
| NDCG@k | Normalized Discounted Cumulative Gain — overall ranking quality | DCG@k / ideal DCG@k (accounts for graded relevance) | > 0.6 at k=10 |
| MAP | Mean Average Precision — precision across all recall levels | Mean of AP for each query (AP = area under P-R curve) | > 0.4 |
| Zero-result rate | Percentage of queries that return no results | Queries with 0 results / total queries | < 5% |
| Click-through rate | Percentage of searches where user clicks a result | Searches with clicks / total searches | > 50% |
| p50/p95/p99 latency | Search response time at percentiles | Percentile of response time distribution | p50 < 50ms, p95 < 200ms |
import numpy as np
def mrr(ranked_results: list, relevant_ids: set) -> float:
"""Mean Reciprocal Rank for a single query."""
for rank, result in enumerate(ranked_results, 1):
if result.doc_id in relevant_ids:
return 1.0 / rank
return 0.0
def ndcg_at_k(ranked_results: list, relevance_labels: dict, k: int = 10) -> float:
"""NDCG@k for a single query."""
def dcg(scores, k):
return sum(
(2**score - 1) / np.log2(rank + 2)
for rank, score in enumerate(scores[:k])
)
actual_scores = [relevance_labels.get(r.doc_id, 0) for r in ranked_results]
ideal_scores = sorted(relevance_labels.values(), reverse=True)
actual_dcg = dcg(actual_scores, k)
ideal_dcg = dcg(ideal_scores, k)
return actual_dcg / ideal_dcg if ideal_dcg > 0 else 0.0
def evaluate_search(queries: list, search_fn, ground_truth: dict) -> dict:
"""Run full evaluation across a query set."""
mrr_scores, ndcg_scores = [], []
for query_data in queries:
query = query_data["query"]
relevant = set(query_data["relevant_doc_ids"])
labels = query_data.get("relevance_labels", {})
results = search_fn(query)
mrr_scores.append(mrr(results, relevant))
if labels:
ndcg_scores.append(ndcg_at_k(results, labels))
return {
"mrr": round(np.mean(mrr_scores), 4),
"ndcg@10": round(np.mean(ndcg_scores), 4) if ndcg_scores else None,
"num_queries": len(queries),
}
A/B Testing Search Changes
Every ranking change must be validated with an A/B test. Here is the production pattern:
import hashlib
import random
class SearchABTest:
"""Route users to control or treatment search pipelines."""
def __init__(self, test_name: str, treatment_pct: float = 0.1):
self.test_name = test_name
self.treatment_pct = treatment_pct
def get_variant(self, user_id: str) -> str:
"""Deterministic assignment: same user always gets same variant."""
hash_input = f"{self.test_name}:{user_id}"
hash_val = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
bucket = (hash_val % 1000) / 1000
return "treatment" if bucket < self.treatment_pct else "control"
def search(self, query, user_id, control_pipeline, treatment_pipeline):
variant = self.get_variant(user_id)
if variant == "treatment":
results = treatment_pipeline.search(query)
else:
results = control_pipeline.search(query)
# Log for analysis
self._log_event({
"test": self.test_name,
"variant": variant,
"user_id": user_id,
"query": query,
"result_ids": [r.doc_id for r in results[:10]],
})
return results
# Key A/B test metrics to track:
# - Click-through rate (CTR) per variant
# - Mean Reciprocal Rank (MRR) from clicks
# - Zero-result rate
# - Reformulation rate (user changes query)
# - Revenue per search (for e-commerce)
# Minimum sample: 1,000 searches per variant for meaningful results
Relevance Tuning Process
Follow this systematic process to improve search relevance:
Step 1: Build Evaluation Set
Create 100+ query-result pairs with graded relevance labels (0=irrelevant, 1=partial, 2=relevant, 3=perfect). Include head queries (popular), torso queries (medium), and tail queries (rare). This is your ground truth.
Step 2: Baseline Metrics
Run evaluation and record MRR, NDCG@10, and zero-result rate. This is your starting point. Every change must be measured against this baseline. Log all results for comparison.
Step 3: Identify Failure Categories
Manually review the worst 20% of queries. Categorize failures: wrong entity resolution? Missing synonyms? Bad chunking? Ranking issue? Each category suggests a specific fix from this course.
Step 4: Implement and Measure
Make one change at a time. Re-run evaluation after each change. If NDCG improves by 2%+ on your evaluation set, A/B test it in production. If it regresses, revert. Track the impact of every change.
Production Deployment Checklist
Indexing (Lesson 2)
- Document processing pipeline tested — handles PDFs, HTML, tables, and edge cases
- Content hashing implemented — incremental indexing reduces costs by 80%+
- Elasticsearch index mapping reviewed — correct field types, analyzers, and HNSW params
- NRT indexing pipeline deployed — new content searchable within seconds if needed
Retrieval & Ranking (Lesson 3)
- Hybrid search configured — BM25 + vector with RRF or tuned alpha
- Cross-encoder re-ranking deployed — top 50–100 candidates re-ranked
- Business rules layer implemented — sponsored, blocked, freshness boosts
Query Understanding (Lesson 4)
- Spell correction with domain dictionary — brand names, product terms, tech terms added
- Intent classification deployed — navigational, informational, transactional routing
- Entity recognition extracting filters — brand, color, price from free text
Infrastructure (Lesson 6)
- Cluster properly sized — dedicated masters, hot/warm data nodes, coordinating nodes
- Caching layers deployed — result cache, embedding cache, ES request cache
- Monitoring dashboard live — latency percentiles, error rate, cache hit rate, zero-result rate
- Alerting configured — p95 latency spikes, error rate increases, indexing lag
Measurement (This Lesson)
- Evaluation dataset created — 100+ queries with graded relevance labels
- Baseline metrics recorded — MRR, NDCG@10, zero-result rate, CTR
- A/B testing framework deployed — deterministic user assignment, metric logging
- Click logging pipeline deployed — query, results, clicks, dwell time all captured
Frequently Asked Questions
Use Elasticsearch if you need full-text search AND vector search in one system (most common case). Use a dedicated vector database (Pinecone, Qdrant, Weaviate) if vector search is your primary use case and you need advanced vector features like quantization or multi-tenancy. For most teams, Elasticsearch 8.x with native kNN search is the right starting point — it avoids managing two systems.
A single Elasticsearch node with 64GB RAM can handle 10–50 million documents with sub-100ms latency. Vector search with HNSW needs about 7.5GB RAM per million 1536-d vectors. Start worrying about scaling at 10M+ documents or 1000+ queries per second. Before that, focus on search quality (relevance tuning) rather than infrastructure.
For BM25: use language-specific analyzers in Elasticsearch (one field per language, or language-specific indices). For vector search: use a multilingual embedding model (Cohere embed-v4, multilingual-e5-large) that maps all languages to the same vector space. A query in English can then retrieve documents in French or Japanese. Test retrieval quality per language — it varies significantly.
Start with OpenAI text-embedding-3-small (1536-d, $0.02/1M tokens) for general purpose. If cost is a concern, use text-embedding-3-small with reduced dimensions (256-d or 512-d). For open-source, nomic-embed-text or bge-large-en-v1.5 are strong options. Always benchmark against YOUR data — MTEB leaderboard scores do not always predict domain-specific performance.
Track these metrics weekly: MRR > 0.5 (first relevant result in top 2 on average), zero-result rate < 5%, CTR > 50%, and p95 latency < 200ms. More importantly, review the bottom 10% of queries manually each week. If users consistently reformulate queries or leave without clicking, those queries are failing. Fix the failure categories one at a time.
Add personalization after your baseline relevance is solid (MRR > 0.5, NDCG > 0.6). Personalization on top of poor relevance makes things worse — you are personalizing bad results. Start with location (easy, high impact) and session context (no login required). Add historical personalization only when you have enough user data (1,000+ interactions per user segment).
Zero results is the worst search experience. Layer your fallback strategy: (1) spell-correct and retry, (2) relax filters and retry, (3) use semantic search with no filters, (4) show "did you mean?" suggestions, (5) show popular/trending results as a fallback. Log all zero-result queries and review them weekly to identify gaps in your index or query understanding.
Not measuring. Teams build a search system, ship it, and never look at whether users are finding what they need. Without metrics (MRR, NDCG, CTR, zero-result rate), you are flying blind. The second biggest mistake is optimizing infrastructure (more shards, bigger clusters) when the problem is relevance (bad ranking, missing synonyms, no spell correction). Fix relevance first, then scale.
Course Summary
| Week | Focus | Deliverable |
|---|---|---|
| 1 | Indexing + Hybrid Search | Documents indexed in Elasticsearch with BM25 + vector search working |
| 2 | Ranking Pipeline | Multi-stage ranking with cross-encoder re-ranking deployed |
| 3 | Query Understanding | Spell correction, intent detection, and entity extraction live |
| 4 | Evaluation | 100+ query evaluation set, baseline metrics, A/B test framework |
| 5 | Optimization | Caching, latency optimization, monitoring dashboard deployed |
| 6 | Launch | Production deployment with A/B testing and click logging live |
Lilly Tech Systems