Advanced

Design Google Image Search

Design a system that lets users search billions of images using text queries and reverse image search. This question tests your understanding of multimodal embeddings, vector search at massive scale, and the infrastructure needed to serve image-heavy results with low latency.

Step 1: Clarify Requirements

Requirements (confirmed with interviewer):

Scale:
- 10B+ images indexed (crawled from the web)
- 100K new images added per hour
- 500M search queries per day (~6K QPS average, 20K QPS peak)
- Results page: return top 50 images per query

Latency:
- Text-to-image search: < 300ms (end-to-end including thumbnails)
- Reverse image search: < 500ms
- Image thumbnail loading: < 100ms (CDN-served)

Functional:
- Text query -> ranked list of relevant images
- Image query -> visually similar images (reverse image search)
- SafeSearch filtering (explicit content)
- Image metadata: source URL, title, dimensions, date
- Query suggestions and autocomplete

Success Metrics:
- Online: Click-through rate, query abandonment rate, dwell time
- Offline: NDCG@50, Precision@10, Recall@100
- Quality: Human relevance ratings (1-5 scale)

Constraints:
- Must handle queries in 100+ languages
- Copyright and licensing metadata where available
- Near-duplicate detection (don't show same image 5 times)

Step 2: High-Level Architecture

Architecture: Image Search System

OFFLINE PIPELINE (Index Building):

[Web Crawler] ---> 100K new images/hour
    |
    v
[Image Processing Pipeline]
    |--- Download and validate image
    |--- Generate thumbnails (150px, 300px, 600px)
    |--- Upload thumbnails to CDN (Cloud Storage + CDN)
    |--- Extract metadata (EXIF, alt text, page title, surrounding text)
    |
    v
[Embedding Generation Pipeline (GPU Cluster)]
    |--- CLIP model: image -> 512-dim embedding
    |--- Text encoder: metadata text -> 512-dim embedding
    |--- Combined embedding for each image
    |--- SafeSearch classifier: score 0-1 (explicit content probability)
    |--- Near-duplicate detection: perceptual hash + embedding clustering
    |
    v
[Vector Index Builder]
    |--- Add embeddings to distributed FAISS/ScaNN index
    |--- Rebuild index shards incrementally (every 6 hours)
    |--- Store metadata in document store (Elasticsearch)

ONLINE PIPELINE (Query Serving):

User Query ("sunset over mountains" or uploaded image)
    |
    v
[Query Understanding Service] (< 30ms)
    |--- Spell correction, query expansion
    |--- Language detection and translation
    |--- Intent classification (text search vs. visual search)
    |--- SafeSearch setting detection
    |
    v
[Query Embedding Service] (< 20ms)
    |--- Text query -> CLIP text encoder -> 512-dim embedding
    |--- Image query -> CLIP image encoder -> 512-dim embedding
    |
    v
[Vector Search Service] (< 50ms)
    |--- Distributed ANN search across index shards
    |--- Return top-500 candidates with similarity scores
    |
    v
[Ranking Service] (< 50ms)
    |--- Re-rank 500 candidates using richer features:
    |       - Visual similarity score (from ANN)
    |       - Text relevance (BM25 on metadata)
    |       - Image quality score (resolution, aesthetics)
    |       - Source authority (domain trust score)
    |       - Freshness (prefer recent images)
    |       - SafeSearch filter
    |--- Near-duplicate removal
    |--- Diversity injection (different sources, styles)
    |
    v
[Response Assembly] (< 30ms)
    |--- Fetch thumbnails URLs from CDN
    |--- Assemble metadata (title, source, dimensions)
    |--- Return top 50 results
    |
    v
[CDN] ---> Serve thumbnail images to client (< 100ms)

Step 3: Deep Dive — Embedding Pipeline

The embedding pipeline processes 100K images per hour and must produce high-quality multimodal embeddings that align text and image representations in the same vector space.

CLIP-Based Embedding Architecture

# Embedding generation pipeline

class ImageEmbeddingPipeline:
    def __init__(self):
        self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
        self.clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
        self.safesearch_classifier = SafeSearchModel()

    def process_image(self, image_url: str, metadata: dict) -> ImageIndex:
        # 1. Download and validate
        image = download_image(image_url, timeout=10)
        if image.width < 100 or image.height < 100:
            raise ImageTooSmall()

        # 2. Generate image embedding (CLIP visual encoder)
        image_inputs = self.clip_processor(images=image, return_tensors="pt")
        image_embedding = self.clip_model.get_image_features(**image_inputs)
        image_embedding = F.normalize(image_embedding, dim=-1)  # [1, 512]

        # 3. Generate text embedding from metadata
        text = f"{metadata['alt_text']} {metadata['page_title']} {metadata['surrounding_text']}"
        text_inputs = self.clip_processor(text=text, return_tensors="pt", truncation=True)
        text_embedding = self.clip_model.get_text_features(**text_inputs)
        text_embedding = F.normalize(text_embedding, dim=-1)  # [1, 512]

        # 4. Combined embedding (weighted average)
        combined = F.normalize(0.7 * image_embedding + 0.3 * text_embedding, dim=-1)

        # 5. SafeSearch classification
        safesearch_score = self.safesearch_classifier(image)

        # 6. Perceptual hash for near-duplicate detection
        phash = compute_perceptual_hash(image)

        # 7. Generate thumbnails
        thumbnails = {
            "small": resize(image, 150),   # 150px width
            "medium": resize(image, 300),  # 300px width
            "large": resize(image, 600),   # 600px width
        }

        return ImageIndex(
            embedding=combined,         # 512-dim float32
            safesearch_score=safesearch_score,
            phash=phash,
            metadata=metadata,
            thumbnails=thumbnails
        )

# Scale calculations:
# 100K images/hour = ~28 images/second
# CLIP inference: ~50ms per image on GPU (batched: ~5ms per image)
# Pipeline: 4 GPUs handle 28 images/sec with room to spare
# Storage: 10B images * 512 dims * 4 bytes = 20 TB for embeddings alone

Step 3: Deep Dive — Vector Search at Scale

Searching 10B vectors in under 50ms requires distributed indexing with quantization. Brute-force search is impossible at this scale.

Distributed FAISS Architecture

# Vector index architecture for 10B images

INDEX_CONFIG = {
    "total_vectors": 10_000_000_000,  # 10B images
    "embedding_dim": 512,
    "index_type": "IVF65536,PQ64",    # Inverted file index + product quantization

    # Sharding strategy:
    "num_shards": 100,                # 100M vectors per shard
    "replicas_per_shard": 3,          # 3 replicas for availability
    "total_machines": 300,            # 100 shards * 3 replicas

    # Product quantization reduces memory:
    # Raw: 10B * 512 * 4 bytes = 20 TB
    # PQ64: 10B * 64 bytes = 640 GB (31x compression)
    # Per shard: 640 GB / 100 shards = 6.4 GB (fits in RAM)

    # Search performance:
    # nprobe=32 (search 32 of 65536 clusters per shard)
    # Per-shard search: ~5ms
    # Scatter-gather across 100 shards: ~15ms (parallel)
    # Merge top results from all shards: ~5ms
    # Total ANN search: ~20ms
}

# Search flow:
# 1. Query embedding arrives (512-dim vector)
# 2. Scatter query to all 100 shards (parallel gRPC calls)
# 3. Each shard searches its local IVF-PQ index (nprobe=32)
# 4. Each shard returns top-50 results with distances
# 5. Coordinator merges 100 * 50 = 5000 results
# 6. Global top-500 selected for re-ranking

# Index updates:
# New images: Added to a small "delta index" (in-memory, brute-force)
# Every 6 hours: Delta merged into main index shards
# Full rebuild: Monthly (redistributes vectors for balanced shards)

Query Understanding Pipeline

# Query understanding for image search

class QueryUnderstanding:
    def process(self, raw_query: str) -> ProcessedQuery:
        # 1. Spell correction
        corrected = self.spell_checker.correct(raw_query)  # "sunet mountains" -> "sunset mountains"

        # 2. Language detection and translation
        lang = self.lang_detector.detect(corrected)
        if lang != "en":
            corrected_en = self.translator.translate(corrected, target="en")
        else:
            corrected_en = corrected

        # 3. Query expansion
        # "sunset mountains" -> "sunset mountains landscape scenic golden hour"
        expanded = self.query_expander.expand(corrected_en, max_terms=3)

        # 4. Intent classification
        intent = self.intent_classifier.classify(corrected_en)
        # Intents: SPECIFIC_IMAGE, CATEGORY_BROWSE, VISUAL_SIMILAR, INFORMATIONAL

        # 5. SafeSearch interpretation
        safesearch = self.safesearch_detector.check(corrected_en)

        return ProcessedQuery(
            original=raw_query,
            corrected=corrected_en,
            expanded=expanded,
            intent=intent,
            safesearch_level=safesearch,
            language=lang
        )

Step 4: Trade-Offs Discussion

Decision	Chosen	Alternative	Rationale
Embedding model	CLIP (ViT-L/14)	Custom trained model	CLIP provides strong zero-shot text-image alignment; custom model requires massive training data
Vector index	FAISS IVF-PQ	HNSW graph index	IVF-PQ uses 30x less memory at this scale; HNSW has better recall but requires full vectors in RAM
Sharding	Random sharding	Semantic sharding	Random is simpler and avoids hot shards; every query searches all shards anyway
Re-ranking	Two-stage (ANN + re-ranker)	Single-stage ANN only	Re-ranker uses richer features (text match, quality, freshness) that ANN cannot encode
Thumbnails	Pre-generated + CDN	On-the-fly resize	Pre-generated avoids serving latency; CDN provides global edge caching

Caching Strategy

Query cache (Redis): Cache search results for popular queries (30% hit rate, TTL: 1 hour). "cute puppies" gets millions of queries per day.
Embedding cache: Cache CLIP text embeddings for frequent queries to skip text encoder inference (saves 20ms per cached query)
CDN for thumbnails: All thumbnails served from CDN edge nodes. Cache hit rate: 95%+. Total CDN cost: ~$50K/month for 10B images
Result prefetch: Prefetch page 2 results while user views page 1, reducing perceived latency for pagination

Key Takeaways

💡

Image search is fundamentally a vector search problem — convert images and queries to embeddings in a shared space (CLIP)
10B-scale vector search requires distributed indexing with quantization (IVF-PQ reduces 20TB to 640GB)
Two-stage retrieval: fast ANN search (top 500) then re-ranking with rich features (top 50)
Query understanding is critical: spell correction, language translation, intent classification, and query expansion
Near-duplicate removal and diversity injection prevent showing the same image multiple times
CDN integration for thumbnail serving is as important as the search system itself

← Previous Fraud Detection Next → Autonomous Vehicles