Scaling Search Infrastructure
When your search system grows from thousands to billions of documents and from hundreds to millions of queries per day, you need different architecture patterns. This lesson covers Elasticsearch cluster design, vector index scaling, caching, geo-distribution, and latency optimization.
Elasticsearch/OpenSearch Cluster Design
A production Elasticsearch cluster separates concerns across node roles for stability and performance:
| Node Role | Responsibility | Sizing | Count |
|---|---|---|---|
| Master | Cluster state, shard allocation, index creation | 4 vCPU, 8GB RAM, 50GB SSD | 3 (odd number for quorum) |
| Data (hot) | Active indices, recent data, search and indexing | 16 vCPU, 64GB RAM, 1TB NVMe SSD | 3–12 (scale horizontally) |
| Data (warm) | Older indices, read-only, lower query volume | 8 vCPU, 32GB RAM, 4TB HDD | 2–6 |
| Coordinating | Route queries, merge results from data nodes | 8 vCPU, 16GB RAM | 2–4 (behind load balancer) |
| ML/Vector | Vector search, cross-encoder re-ranking | 8 vCPU, 32GB RAM, GPU optional | 2–4 |
# Elasticsearch cluster configuration for production
# elasticsearch.yml on a data node
cluster.name: search-production
node.name: data-hot-1
node.roles: [data_hot, data_content]
# Memory: give 50% of RAM to JVM heap, max 31GB
# The other 50% is used by the OS for file system cache
# ES_JAVA_OPTS: "-Xms31g -Xmx31g"
# Shard sizing rules of thumb:
# - Target 20-40GB per shard for keyword search
# - Target 10-20GB per shard for vector search (HNSW uses more memory)
# - Max 20 shards per GB of heap memory
# - Each shard has overhead: aim for fewer, larger shards
# Index lifecycle management
PUT _ilm/policy/search-lifecycle
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": { "max_size": "50gb", "max_age": "7d" },
"set_priority": { "priority": 100 }
}
},
"warm": {
"min_age": "30d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 },
"set_priority": { "priority": 50 }
}
},
"delete": {
"min_age": "365d",
"actions": { "delete": {} }
}
}
}
}
Vector Index Scaling (Billions of Vectors)
HNSW indices must fit in memory for fast search. At billion-scale, you need to partition vectors across nodes and use quantization to reduce memory.
# Memory calculation for HNSW vector indices
def calculate_hnsw_memory(
num_vectors: int,
dimensions: int,
m: int = 16, # HNSW connections per node
bytes_per_dim: int = 4 # float32
) -> dict:
"""Calculate memory requirements for an HNSW index."""
# Vector storage: num_vectors * dimensions * bytes_per_dim
vector_bytes = num_vectors * dimensions * bytes_per_dim
# Graph storage: num_vectors * m * 2 * 4 bytes (bidirectional edges, int32)
graph_bytes = num_vectors * m * 2 * 4
# Metadata overhead: ~100 bytes per vector
metadata_bytes = num_vectors * 100
total_bytes = vector_bytes + graph_bytes + metadata_bytes
total_gb = total_bytes / (1024**3)
return {
"vectors_gb": vector_bytes / (1024**3),
"graph_gb": graph_bytes / (1024**3),
"metadata_gb": metadata_bytes / (1024**3),
"total_gb": round(total_gb, 1),
"recommended_ram_gb": round(total_gb * 1.3, 1), # 30% headroom
}
# Examples:
print(calculate_hnsw_memory(1_000_000, 1536)) # 1M vectors: ~7.5 GB
print(calculate_hnsw_memory(10_000_000, 1536)) # 10M vectors: ~75 GB
print(calculate_hnsw_memory(100_000_000, 1536)) # 100M vectors: ~750 GB
print(calculate_hnsw_memory(1_000_000_000, 1536)) # 1B vectors: ~7.5 TB
# At 100M+ vectors, you MUST use one of these strategies:
# 1. Shard across multiple nodes (most common)
# 2. Product quantization (PQ): compress 1536-d to 96-d, 16x less memory
# 3. Scalar quantization: float32 -> int8, 4x less memory
# 4. Reduce embedding dimensions (text-embedding-3-small supports 256-d)
Scaling Strategy: Sharded Vector Search
class ShardedVectorSearch:
"""Distribute vector search across multiple nodes."""
def __init__(self, shard_clients: list):
self.shards = shard_clients # List of Qdrant/Pinecone clients
async def search(self, query_vector, top_k=10, filters=None):
"""Search all shards in parallel and merge results."""
import asyncio
async def search_shard(shard, shard_id):
results = await shard.search(
query_vector=query_vector,
limit=top_k * 2, # Over-fetch for better merge quality
filters=filters,
)
return [(r.id, r.score, shard_id) for r in results]
# Fan-out: query all shards in parallel
tasks = [search_shard(s, i) for i, s in enumerate(self.shards)]
shard_results = await asyncio.gather(*tasks)
# Merge: combine and re-sort by score
all_results = []
for results in shard_results:
all_results.extend(results)
all_results.sort(key=lambda x: x[1], reverse=True)
return all_results[:top_k]
def assign_shard(self, doc_id: str) -> int:
"""Deterministic shard assignment using consistent hashing."""
import hashlib
hash_val = int(hashlib.md5(doc_id.encode()).hexdigest(), 16)
return hash_val % len(self.shards)
Caching Strategies
Caching is the single most effective latency optimization. Layer your caches:
| Cache Layer | What It Caches | Hit Rate | Latency Saved |
|---|---|---|---|
| Query result cache | Exact query string to final results | 15–30% | 90–95% (skip entire pipeline) |
| Semantic query cache | Similar queries mapped to same results | 10–20% | 80–90% |
| Embedding cache | Query string to embedding vector | 20–40% | 30–50ms (skip API call) |
| ES request cache | Elasticsearch query to raw results | 30–50% | 10–20ms |
| OS file cache | Index segments in memory | 80–95% | Disk I/O avoided |
import redis
import json
import hashlib
from typing import Optional
class SearchCache:
"""Multi-layer search cache with TTL management."""
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
def get_results(self, query: str, filters: dict) -> Optional[list]:
"""Check exact match cache."""
key = self._cache_key(query, filters)
cached = self.redis.get(f"search:exact:{key}")
if cached:
return json.loads(cached)
return None
def set_results(self, query: str, filters: dict, results: list,
ttl: int = 300):
"""Cache results for 5 minutes by default."""
key = self._cache_key(query, filters)
self.redis.setex(
f"search:exact:{key}", ttl, json.dumps(results)
)
def get_embedding(self, text: str) -> Optional[list]:
"""Cache embeddings to avoid repeated API calls."""
key = hashlib.md5(text.encode()).hexdigest()
cached = self.redis.get(f"search:emb:{key}")
if cached:
return json.loads(cached)
return None
def set_embedding(self, text: str, embedding: list, ttl: int = 86400):
"""Cache embeddings for 24 hours."""
key = hashlib.md5(text.encode()).hexdigest()
self.redis.setex(f"search:emb:{key}", ttl, json.dumps(embedding))
def _cache_key(self, query: str, filters: dict) -> str:
raw = f"{query.lower().strip()}|{json.dumps(filters, sort_keys=True)}"
return hashlib.md5(raw.encode()).hexdigest()
Geo-Distributed Search
For global applications, deploy search clusters in multiple regions to reduce latency:
# Geo-distributed search architecture
#
# Strategy 1: Full replication (simple, consistent)
# - Full index copy in each region (US-East, EU-West, AP-Southeast)
# - Writes go to primary region, replicate async to others
# - Reads served by nearest region
# - Replication lag: 1-5 seconds
# - Best for: read-heavy workloads, content that doesn't change per-region
#
# Strategy 2: Geo-partitioned (lower cost, some complexity)
# - Each region holds only local content + global content
# - "Restaurants near me" searches local index only
# - "Python tutorial" searches global index
# - Best for: location-heavy content (marketplace, local services)
class GeoRouter:
"""Route search queries to the nearest cluster."""
def __init__(self, clusters: dict):
# {"us-east": ESClient, "eu-west": ESClient, "ap-southeast": ESClient}
self.clusters = clusters
self.region_coords = {
"us-east": (39.0, -77.0),
"eu-west": (53.0, -8.0),
"ap-southeast": (1.3, 103.8),
}
def route(self, user_lat: float, user_lng: float) -> str:
"""Find nearest cluster by haversine distance."""
nearest = min(
self.region_coords.items(),
key=lambda r: self._distance(user_lat, user_lng, r[1][0], r[1][1])
)
return nearest[0]
def search(self, query: str, user_lat: float, user_lng: float, **kwargs):
region = self.route(user_lat, user_lng)
client = self.clusters[region]
return client.search(query=query, **kwargs)
Latency Optimization Checklist
Follow this checklist to achieve sub-100ms search latency at scale:
Index Optimization
- Shard size 20–40GB, fewer larger shards
- Force-merge read-only indices to 1 segment
- Use
keywordfor exact match fields,textonly for full-text - Disable
_sourceif you do not need full documents
Query Optimization
- Use
filtercontext for non-scoring clauses (cached by ES) - Limit
knn.num_candidatesto 100–200 - Use
terminate_afterfor approximate counts - Avoid wildcard queries at the start of terms
Infrastructure
- NVMe SSDs for hot data nodes
- 50% of RAM for JVM heap, 50% for OS file cache
- Coordinating nodes behind a load balancer
- Network: same availability zone for data + coordinating
Application Layer
- Cache query embeddings (saves 30–50ms per query)
- Cache full results for popular queries (saves 80–95ms)
- Connection pooling to Elasticsearch
- Async cross-encoder re-ranking with timeout fallback
Key Takeaways
- Separate Elasticsearch node roles: dedicated masters, hot/warm data nodes, coordinating nodes behind a load balancer.
- HNSW vector indices need ~7.5GB per 1M vectors (1536-d, float32). At 100M+, use sharding, quantization, or dimension reduction.
- Layer your caches: result cache (highest impact), embedding cache (saves API calls), ES request cache, OS file cache.
- Geo-distribute clusters for global users. Full replication is simpler; geo-partitioning is cheaper for location-heavy content.
- Target 20–40GB per shard, force-merge read-only indices, and use filter context for non-scoring clauses.
What Is Next
In the final lesson, we cover best practices and a production checklist — search quality metrics (MRR, NDCG), A/B testing search changes, relevance tuning, and a comprehensive FAQ for search engineers.
Lilly Tech Systems