Advanced
Design Google Image Search
Design a system that lets users search billions of images using text queries and reverse image search. This question tests your understanding of multimodal embeddings, vector search at massive scale, and the infrastructure needed to serve image-heavy results with low latency.
Step 1: Clarify Requirements
Requirements (confirmed with interviewer):
Scale:
- 10B+ images indexed (crawled from the web)
- 100K new images added per hour
- 500M search queries per day (~6K QPS average, 20K QPS peak)
- Results page: return top 50 images per query
Latency:
- Text-to-image search: < 300ms (end-to-end including thumbnails)
- Reverse image search: < 500ms
- Image thumbnail loading: < 100ms (CDN-served)
Functional:
- Text query -> ranked list of relevant images
- Image query -> visually similar images (reverse image search)
- SafeSearch filtering (explicit content)
- Image metadata: source URL, title, dimensions, date
- Query suggestions and autocomplete
Success Metrics:
- Online: Click-through rate, query abandonment rate, dwell time
- Offline: NDCG@50, Precision@10, Recall@100
- Quality: Human relevance ratings (1-5 scale)
Constraints:
- Must handle queries in 100+ languages
- Copyright and licensing metadata where available
- Near-duplicate detection (don't show same image 5 times)
Step 2: High-Level Architecture
Architecture: Image Search System
OFFLINE PIPELINE (Index Building):
[Web Crawler] ---> 100K new images/hour
|
v
[Image Processing Pipeline]
|--- Download and validate image
|--- Generate thumbnails (150px, 300px, 600px)
|--- Upload thumbnails to CDN (Cloud Storage + CDN)
|--- Extract metadata (EXIF, alt text, page title, surrounding text)
|
v
[Embedding Generation Pipeline (GPU Cluster)]
|--- CLIP model: image -> 512-dim embedding
|--- Text encoder: metadata text -> 512-dim embedding
|--- Combined embedding for each image
|--- SafeSearch classifier: score 0-1 (explicit content probability)
|--- Near-duplicate detection: perceptual hash + embedding clustering
|
v
[Vector Index Builder]
|--- Add embeddings to distributed FAISS/ScaNN index
|--- Rebuild index shards incrementally (every 6 hours)
|--- Store metadata in document store (Elasticsearch)
ONLINE PIPELINE (Query Serving):
User Query ("sunset over mountains" or uploaded image)
|
v
[Query Understanding Service] (< 30ms)
|--- Spell correction, query expansion
|--- Language detection and translation
|--- Intent classification (text search vs. visual search)
|--- SafeSearch setting detection
|
v
[Query Embedding Service] (< 20ms)
|--- Text query -> CLIP text encoder -> 512-dim embedding
|--- Image query -> CLIP image encoder -> 512-dim embedding
|
v
[Vector Search Service] (< 50ms)
|--- Distributed ANN search across index shards
|--- Return top-500 candidates with similarity scores
|
v
[Ranking Service] (< 50ms)
|--- Re-rank 500 candidates using richer features:
| - Visual similarity score (from ANN)
| - Text relevance (BM25 on metadata)
| - Image quality score (resolution, aesthetics)
| - Source authority (domain trust score)
| - Freshness (prefer recent images)
| - SafeSearch filter
|--- Near-duplicate removal
|--- Diversity injection (different sources, styles)
|
v
[Response Assembly] (< 30ms)
|--- Fetch thumbnails URLs from CDN
|--- Assemble metadata (title, source, dimensions)
|--- Return top 50 results
|
v
[CDN] ---> Serve thumbnail images to client (< 100ms)
Step 3: Deep Dive — Embedding Pipeline
The embedding pipeline processes 100K images per hour and must produce high-quality multimodal embeddings that align text and image representations in the same vector space.
CLIP-Based Embedding Architecture
# Embedding generation pipeline
class ImageEmbeddingPipeline:
def __init__(self):
self.clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
self.clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
self.safesearch_classifier = SafeSearchModel()
def process_image(self, image_url: str, metadata: dict) -> ImageIndex:
# 1. Download and validate
image = download_image(image_url, timeout=10)
if image.width < 100 or image.height < 100:
raise ImageTooSmall()
# 2. Generate image embedding (CLIP visual encoder)
image_inputs = self.clip_processor(images=image, return_tensors="pt")
image_embedding = self.clip_model.get_image_features(**image_inputs)
image_embedding = F.normalize(image_embedding, dim=-1) # [1, 512]
# 3. Generate text embedding from metadata
text = f"{metadata['alt_text']} {metadata['page_title']} {metadata['surrounding_text']}"
text_inputs = self.clip_processor(text=text, return_tensors="pt", truncation=True)
text_embedding = self.clip_model.get_text_features(**text_inputs)
text_embedding = F.normalize(text_embedding, dim=-1) # [1, 512]
# 4. Combined embedding (weighted average)
combined = F.normalize(0.7 * image_embedding + 0.3 * text_embedding, dim=-1)
# 5. SafeSearch classification
safesearch_score = self.safesearch_classifier(image)
# 6. Perceptual hash for near-duplicate detection
phash = compute_perceptual_hash(image)
# 7. Generate thumbnails
thumbnails = {
"small": resize(image, 150), # 150px width
"medium": resize(image, 300), # 300px width
"large": resize(image, 600), # 600px width
}
return ImageIndex(
embedding=combined, # 512-dim float32
safesearch_score=safesearch_score,
phash=phash,
metadata=metadata,
thumbnails=thumbnails
)
# Scale calculations:
# 100K images/hour = ~28 images/second
# CLIP inference: ~50ms per image on GPU (batched: ~5ms per image)
# Pipeline: 4 GPUs handle 28 images/sec with room to spare
# Storage: 10B images * 512 dims * 4 bytes = 20 TB for embeddings alone
Step 3: Deep Dive — Vector Search at Scale
Searching 10B vectors in under 50ms requires distributed indexing with quantization. Brute-force search is impossible at this scale.
Distributed FAISS Architecture
# Vector index architecture for 10B images
INDEX_CONFIG = {
"total_vectors": 10_000_000_000, # 10B images
"embedding_dim": 512,
"index_type": "IVF65536,PQ64", # Inverted file index + product quantization
# Sharding strategy:
"num_shards": 100, # 100M vectors per shard
"replicas_per_shard": 3, # 3 replicas for availability
"total_machines": 300, # 100 shards * 3 replicas
# Product quantization reduces memory:
# Raw: 10B * 512 * 4 bytes = 20 TB
# PQ64: 10B * 64 bytes = 640 GB (31x compression)
# Per shard: 640 GB / 100 shards = 6.4 GB (fits in RAM)
# Search performance:
# nprobe=32 (search 32 of 65536 clusters per shard)
# Per-shard search: ~5ms
# Scatter-gather across 100 shards: ~15ms (parallel)
# Merge top results from all shards: ~5ms
# Total ANN search: ~20ms
}
# Search flow:
# 1. Query embedding arrives (512-dim vector)
# 2. Scatter query to all 100 shards (parallel gRPC calls)
# 3. Each shard searches its local IVF-PQ index (nprobe=32)
# 4. Each shard returns top-50 results with distances
# 5. Coordinator merges 100 * 50 = 5000 results
# 6. Global top-500 selected for re-ranking
# Index updates:
# New images: Added to a small "delta index" (in-memory, brute-force)
# Every 6 hours: Delta merged into main index shards
# Full rebuild: Monthly (redistributes vectors for balanced shards)
Query Understanding Pipeline
# Query understanding for image search
class QueryUnderstanding:
def process(self, raw_query: str) -> ProcessedQuery:
# 1. Spell correction
corrected = self.spell_checker.correct(raw_query) # "sunet mountains" -> "sunset mountains"
# 2. Language detection and translation
lang = self.lang_detector.detect(corrected)
if lang != "en":
corrected_en = self.translator.translate(corrected, target="en")
else:
corrected_en = corrected
# 3. Query expansion
# "sunset mountains" -> "sunset mountains landscape scenic golden hour"
expanded = self.query_expander.expand(corrected_en, max_terms=3)
# 4. Intent classification
intent = self.intent_classifier.classify(corrected_en)
# Intents: SPECIFIC_IMAGE, CATEGORY_BROWSE, VISUAL_SIMILAR, INFORMATIONAL
# 5. SafeSearch interpretation
safesearch = self.safesearch_detector.check(corrected_en)
return ProcessedQuery(
original=raw_query,
corrected=corrected_en,
expanded=expanded,
intent=intent,
safesearch_level=safesearch,
language=lang
)
Step 4: Trade-Offs Discussion
| Decision | Chosen | Alternative | Rationale |
|---|---|---|---|
| Embedding model | CLIP (ViT-L/14) | Custom trained model | CLIP provides strong zero-shot text-image alignment; custom model requires massive training data |
| Vector index | FAISS IVF-PQ | HNSW graph index | IVF-PQ uses 30x less memory at this scale; HNSW has better recall but requires full vectors in RAM |
| Sharding | Random sharding | Semantic sharding | Random is simpler and avoids hot shards; every query searches all shards anyway |
| Re-ranking | Two-stage (ANN + re-ranker) | Single-stage ANN only | Re-ranker uses richer features (text match, quality, freshness) that ANN cannot encode |
| Thumbnails | Pre-generated + CDN | On-the-fly resize | Pre-generated avoids serving latency; CDN provides global edge caching |
Caching Strategy
- Query cache (Redis): Cache search results for popular queries (30% hit rate, TTL: 1 hour). "cute puppies" gets millions of queries per day.
- Embedding cache: Cache CLIP text embeddings for frequent queries to skip text encoder inference (saves 20ms per cached query)
- CDN for thumbnails: All thumbnails served from CDN edge nodes. Cache hit rate: 95%+. Total CDN cost: ~$50K/month for 10B images
- Result prefetch: Prefetch page 2 results while user views page 1, reducing perceived latency for pagination
Key Takeaways
- Image search is fundamentally a vector search problem — convert images and queries to embeddings in a shared space (CLIP)
- 10B-scale vector search requires distributed indexing with quantization (IVF-PQ reduces 20TB to 640GB)
- Two-stage retrieval: fast ANN search (top 500) then re-ranking with rich features (top 50)
- Query understanding is critical: spell correction, language translation, intent classification, and query expansion
- Near-duplicate removal and diversity injection prevent showing the same image multiple times
- CDN integration for thumbnail serving is as important as the search system itself
Lilly Tech Systems