Advanced

Scaling Search Infrastructure

When your search system grows from thousands to billions of documents and from hundreds to millions of queries per day, you need different architecture patterns. This lesson covers Elasticsearch cluster design, vector index scaling, caching, geo-distribution, and latency optimization.

Elasticsearch/OpenSearch Cluster Design

A production Elasticsearch cluster separates concerns across node roles for stability and performance:

Node RoleResponsibilitySizingCount
Master Cluster state, shard allocation, index creation 4 vCPU, 8GB RAM, 50GB SSD 3 (odd number for quorum)
Data (hot) Active indices, recent data, search and indexing 16 vCPU, 64GB RAM, 1TB NVMe SSD 3–12 (scale horizontally)
Data (warm) Older indices, read-only, lower query volume 8 vCPU, 32GB RAM, 4TB HDD 2–6
Coordinating Route queries, merge results from data nodes 8 vCPU, 16GB RAM 2–4 (behind load balancer)
ML/Vector Vector search, cross-encoder re-ranking 8 vCPU, 32GB RAM, GPU optional 2–4
# Elasticsearch cluster configuration for production
# elasticsearch.yml on a data node

cluster.name: search-production
node.name: data-hot-1
node.roles: [data_hot, data_content]

# Memory: give 50% of RAM to JVM heap, max 31GB
# The other 50% is used by the OS for file system cache
# ES_JAVA_OPTS: "-Xms31g -Xmx31g"

# Shard sizing rules of thumb:
# - Target 20-40GB per shard for keyword search
# - Target 10-20GB per shard for vector search (HNSW uses more memory)
# - Max 20 shards per GB of heap memory
# - Each shard has overhead: aim for fewer, larger shards

# Index lifecycle management
PUT _ilm/policy/search-lifecycle
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": { "max_size": "50gb", "max_age": "7d" },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": { "delete": {} }
      }
    }
  }
}
💡
Apply at work: The most common Elasticsearch scaling mistake is too many small shards. Each shard consumes memory for metadata regardless of size. A cluster with 10,000 tiny shards will be slower than one with 100 properly-sized shards. Target 20–40GB per shard and use index lifecycle management to merge old indices.

Vector Index Scaling (Billions of Vectors)

HNSW indices must fit in memory for fast search. At billion-scale, you need to partition vectors across nodes and use quantization to reduce memory.

# Memory calculation for HNSW vector indices
def calculate_hnsw_memory(
    num_vectors: int,
    dimensions: int,
    m: int = 16,         # HNSW connections per node
    bytes_per_dim: int = 4  # float32
) -> dict:
    """Calculate memory requirements for an HNSW index."""
    # Vector storage: num_vectors * dimensions * bytes_per_dim
    vector_bytes = num_vectors * dimensions * bytes_per_dim

    # Graph storage: num_vectors * m * 2 * 4 bytes (bidirectional edges, int32)
    graph_bytes = num_vectors * m * 2 * 4

    # Metadata overhead: ~100 bytes per vector
    metadata_bytes = num_vectors * 100

    total_bytes = vector_bytes + graph_bytes + metadata_bytes
    total_gb = total_bytes / (1024**3)

    return {
        "vectors_gb": vector_bytes / (1024**3),
        "graph_gb": graph_bytes / (1024**3),
        "metadata_gb": metadata_bytes / (1024**3),
        "total_gb": round(total_gb, 1),
        "recommended_ram_gb": round(total_gb * 1.3, 1),  # 30% headroom
    }

# Examples:
print(calculate_hnsw_memory(1_000_000, 1536))     # 1M vectors: ~7.5 GB
print(calculate_hnsw_memory(10_000_000, 1536))     # 10M vectors: ~75 GB
print(calculate_hnsw_memory(100_000_000, 1536))    # 100M vectors: ~750 GB
print(calculate_hnsw_memory(1_000_000_000, 1536))  # 1B vectors: ~7.5 TB

# At 100M+ vectors, you MUST use one of these strategies:
# 1. Shard across multiple nodes (most common)
# 2. Product quantization (PQ): compress 1536-d to 96-d, 16x less memory
# 3. Scalar quantization: float32 -> int8, 4x less memory
# 4. Reduce embedding dimensions (text-embedding-3-small supports 256-d)

Scaling Strategy: Sharded Vector Search

class ShardedVectorSearch:
    """Distribute vector search across multiple nodes."""

    def __init__(self, shard_clients: list):
        self.shards = shard_clients  # List of Qdrant/Pinecone clients

    async def search(self, query_vector, top_k=10, filters=None):
        """Search all shards in parallel and merge results."""
        import asyncio

        async def search_shard(shard, shard_id):
            results = await shard.search(
                query_vector=query_vector,
                limit=top_k * 2,  # Over-fetch for better merge quality
                filters=filters,
            )
            return [(r.id, r.score, shard_id) for r in results]

        # Fan-out: query all shards in parallel
        tasks = [search_shard(s, i) for i, s in enumerate(self.shards)]
        shard_results = await asyncio.gather(*tasks)

        # Merge: combine and re-sort by score
        all_results = []
        for results in shard_results:
            all_results.extend(results)

        all_results.sort(key=lambda x: x[1], reverse=True)
        return all_results[:top_k]

    def assign_shard(self, doc_id: str) -> int:
        """Deterministic shard assignment using consistent hashing."""
        import hashlib
        hash_val = int(hashlib.md5(doc_id.encode()).hexdigest(), 16)
        return hash_val % len(self.shards)

Caching Strategies

Caching is the single most effective latency optimization. Layer your caches:

Cache LayerWhat It CachesHit RateLatency Saved
Query result cache Exact query string to final results 15–30% 90–95% (skip entire pipeline)
Semantic query cache Similar queries mapped to same results 10–20% 80–90%
Embedding cache Query string to embedding vector 20–40% 30–50ms (skip API call)
ES request cache Elasticsearch query to raw results 30–50% 10–20ms
OS file cache Index segments in memory 80–95% Disk I/O avoided
import redis
import json
import hashlib
from typing import Optional

class SearchCache:
    """Multi-layer search cache with TTL management."""

    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    def get_results(self, query: str, filters: dict) -> Optional[list]:
        """Check exact match cache."""
        key = self._cache_key(query, filters)
        cached = self.redis.get(f"search:exact:{key}")
        if cached:
            return json.loads(cached)
        return None

    def set_results(self, query: str, filters: dict, results: list,
                    ttl: int = 300):
        """Cache results for 5 minutes by default."""
        key = self._cache_key(query, filters)
        self.redis.setex(
            f"search:exact:{key}", ttl, json.dumps(results)
        )

    def get_embedding(self, text: str) -> Optional[list]:
        """Cache embeddings to avoid repeated API calls."""
        key = hashlib.md5(text.encode()).hexdigest()
        cached = self.redis.get(f"search:emb:{key}")
        if cached:
            return json.loads(cached)
        return None

    def set_embedding(self, text: str, embedding: list, ttl: int = 86400):
        """Cache embeddings for 24 hours."""
        key = hashlib.md5(text.encode()).hexdigest()
        self.redis.setex(f"search:emb:{key}", ttl, json.dumps(embedding))

    def _cache_key(self, query: str, filters: dict) -> str:
        raw = f"{query.lower().strip()}|{json.dumps(filters, sort_keys=True)}"
        return hashlib.md5(raw.encode()).hexdigest()

Geo-Distributed Search

For global applications, deploy search clusters in multiple regions to reduce latency:

# Geo-distributed search architecture
#
# Strategy 1: Full replication (simple, consistent)
# - Full index copy in each region (US-East, EU-West, AP-Southeast)
# - Writes go to primary region, replicate async to others
# - Reads served by nearest region
# - Replication lag: 1-5 seconds
# - Best for: read-heavy workloads, content that doesn't change per-region
#
# Strategy 2: Geo-partitioned (lower cost, some complexity)
# - Each region holds only local content + global content
# - "Restaurants near me" searches local index only
# - "Python tutorial" searches global index
# - Best for: location-heavy content (marketplace, local services)

class GeoRouter:
    """Route search queries to the nearest cluster."""

    def __init__(self, clusters: dict):
        # {"us-east": ESClient, "eu-west": ESClient, "ap-southeast": ESClient}
        self.clusters = clusters
        self.region_coords = {
            "us-east": (39.0, -77.0),
            "eu-west": (53.0, -8.0),
            "ap-southeast": (1.3, 103.8),
        }

    def route(self, user_lat: float, user_lng: float) -> str:
        """Find nearest cluster by haversine distance."""
        nearest = min(
            self.region_coords.items(),
            key=lambda r: self._distance(user_lat, user_lng, r[1][0], r[1][1])
        )
        return nearest[0]

    def search(self, query: str, user_lat: float, user_lng: float, **kwargs):
        region = self.route(user_lat, user_lng)
        client = self.clusters[region]
        return client.search(query=query, **kwargs)

Latency Optimization Checklist

Follow this checklist to achieve sub-100ms search latency at scale:

Index Optimization

  • Shard size 20–40GB, fewer larger shards
  • Force-merge read-only indices to 1 segment
  • Use keyword for exact match fields, text only for full-text
  • Disable _source if you do not need full documents

Query Optimization

  • Use filter context for non-scoring clauses (cached by ES)
  • Limit knn.num_candidates to 100–200
  • Use terminate_after for approximate counts
  • Avoid wildcard queries at the start of terms

Infrastructure

  • NVMe SSDs for hot data nodes
  • 50% of RAM for JVM heap, 50% for OS file cache
  • Coordinating nodes behind a load balancer
  • Network: same availability zone for data + coordinating

Application Layer

  • Cache query embeddings (saves 30–50ms per query)
  • Cache full results for popular queries (saves 80–95ms)
  • Connection pooling to Elasticsearch
  • Async cross-encoder re-ranking with timeout fallback
📝
Production reality: The biggest latency win is almost always caching, not cluster optimization. A 30% cache hit rate on your query result cache saves more aggregate latency than any index tuning. Invest in cache hit rate monitoring and cache warming for popular queries.

Key Takeaways

  • Separate Elasticsearch node roles: dedicated masters, hot/warm data nodes, coordinating nodes behind a load balancer.
  • HNSW vector indices need ~7.5GB per 1M vectors (1536-d, float32). At 100M+, use sharding, quantization, or dimension reduction.
  • Layer your caches: result cache (highest impact), embedding cache (saves API calls), ES request cache, OS file cache.
  • Geo-distribute clusters for global users. Full replication is simpler; geo-partitioning is cheaper for location-heavy content.
  • Target 20–40GB per shard, force-merge read-only indices, and use filter context for non-scoring clauses.

What Is Next

In the final lesson, we cover best practices and a production checklist — search quality metrics (MRR, NDCG), A/B testing search changes, relevance tuning, and a comprehensive FAQ for search engineers.