Intermediate

Design Search Autocomplete with ML

A complete walkthrough of designing an ML-powered search autocomplete system. Move beyond simple trie-based approaches to build a personalized, real-time query prediction system used at Google, Amazon, and Spotify.

Step 1: Clarify Requirements

📝

Key clarifications:

“What kind of search?” — General web search, e-commerce product search, or in-app search
“Scale?” — 10B queries/day, 100K QPS peak, suggestions must appear in <100ms
“Should suggestions be personalized?” — Yes, based on user search history and profile
“How many suggestions to show?” — Top 5–10 ranked suggestions
“Safety constraints?” — Must filter offensive, illegal, and sensitive suggestions

ML Problem Formulation

# Problem formulation
# Business goal:    Help users find what they want faster
# ML task:          Query completion ranking
# Input:            Partial query prefix + user context
# Output:           Ranked list of query completions
# Predict:          P(user selects completion | prefix, user, context)
# Training data:    Search logs with (prefix, completion, selected/not)
# Optimization:     Pairwise ranking loss (or listwise NDCG loss)

Step 2: High-Level Architecture

# Architecture: Retrieval + Ranking (two-stage)
#
# [User types "how to"] --> [Candidate Retrieval] --> [ML Ranking] --> [Filtering] --> [Display]
#                                  |                      |               |
#                           ~1000 candidates         ~50 scored      ~8 shown
#                           (trie + popularity)    (personalized)   (safe, diverse)
#
# Offline systems:
# [Query Log Aggregation] --> [Training Pipeline] --> [Model Registry]
# [Trie Builder] --> [Distributed Trie Shards]
# [Safety Classifier] --> [Blocked Query List]

Why Not Just a Trie?

Approach	Pros	Cons
Trie + Frequency	Simple, fast (<5ms), easy to implement	No personalization, no context, stale popularity counts
Trie + ML Ranking	Fast retrieval + personalized ranking	More complex infrastructure
Pure ML (seq2seq)	Can generate novel suggestions	Too slow for 100ms budget, hard to control safety

💡

Best approach: Use a trie (or prefix index) for fast candidate retrieval, then apply an ML model to re-rank candidates for each user. This gives you the speed of trie lookup with the quality of personalized ML ranking.

Step 3: Deep Dive — Candidate Retrieval

Building the Candidate Index

The candidate index stores all possible query completions and enables fast prefix matching.

# Candidate index construction (offline, runs daily)
#
# 1. Aggregate search logs from the last 30 days
# 2. Count query frequency: {"how to cook rice": 1.2M, ...}
# 3. Filter: remove queries with count < 100 (long tail noise)
# 4. Safety filter: remove offensive/illegal queries
# 5. Build trie with top-K completions per prefix node
# 6. Shard trie across machines by prefix hash
#
# Storage: ~500M unique queries, ~50GB trie index
# Update frequency: Daily full rebuild + hourly delta for trending

Trending Query Handling

The daily trie misses breaking news and trending topics. Add a real-time trending layer:

Stream search logs into a sliding window counter (last 1 hour)
Detect queries with velocity > 10x their 7-day average
Inject trending queries into the candidate set at retrieval time
Apply safety checks before surfacing trending suggestions

Deep Dive — ML Ranking Model

Features for Query Ranking

Feature Group	Features	Description
Query Features	query_frequency, query_length, query_recency, query_category	Popularity and freshness of the completion
Prefix Match	prefix_match_ratio, is_exact_prefix, edit_distance	How well the completion matches the typed prefix
User Features	past_queries_embedding, user_language, user_location, search_frequency	User preferences and behavior
Context Features	time_of_day, device_type, previous_query_in_session	Current session and device context
User-Query Cross	user_searched_before, user_category_affinity, cosine_similarity(user_emb, query_emb)	Personalization signals

Model Architecture

# Lightweight ranking model (must score 1000 candidates in <50ms)
#
# Option A: LambdaMART (gradient boosted trees)
#   - Fast inference (~0.01ms per candidate)
#   - Handles sparse features well
#   - Easy to interpret feature importance
#   - Recommended for v1
#
# Option B: Two-Tower Neural Network
#   - User tower: encodes user features into 64-dim vector
#   - Query tower: encodes query features into 64-dim vector
#   - Score = dot_product(user_vector, query_vector)
#   - User vector can be precomputed and cached
#   - Recommended for v2 with personalization
#
# Option C: Cross-Attention Transformer
#   - Highest quality but too slow for autocomplete latency
#   - Could work for voice search where latency budget is higher

Deep Dive — Personalization

Personalization is what makes ML-powered autocomplete significantly better than frequency-based approaches.

Personalization Signals

Short-term: Queries in the current session — if user searched “flights to Paris,” then “hot” should suggest “hotels in Paris” over “hot dogs”
Medium-term: Recent search history (last 30 days) — a user who frequently searches for recipes gets food-related suggestions
Long-term: User profile attributes — language, location, demographics influence suggestion relevance
Behavioral: Past click-through on suggestions — if user always clicks the 3rd suggestion, adjust ranking

⚠

Privacy consideration: Personalization requires storing user search history. Discuss with the interviewer how you would handle GDPR/CCPA compliance: anonymization, retention limits, opt-out mechanisms, and differential privacy for aggregate statistics.

Deep Dive — Real-Time Serving

Latency Optimization

Users expect suggestions to appear as they type. Every keystroke triggers a request, so latency is critical.

Optimization	Technique	Latency Saved
Client-side debouncing	Wait 50ms after last keystroke before sending request	Reduces QPS by 60%
User vector caching	Precompute and cache user embedding at session start	Eliminates 10ms per request
Trie sharding	Shard by prefix hash, route to correct shard	Reduces index lookup time
Result caching	Cache top suggestions for popular prefixes	80% cache hit rate, <1ms
Early termination	Stop scoring if top-8 candidates are confidently ranked	Reduces scoring time by 40%
Model quantization	INT8 quantized model for inference	2x speedup with <1% quality loss

Deep Dive — Evaluation

Offline Metrics

Metric	Description	Target
MRR (Mean Reciprocal Rank)	Average 1/rank of the selected suggestion	> 0.45
Success@5	Fraction of queries where selected completion is in top 5	> 0.70
pMRR (personalized MRR)	MRR improvement over non-personalized baseline	> 5% lift
Character Savings Ratio	Fraction of characters user did not need to type	> 0.50

Online Metrics (A/B Test)

Metric	Description	Guardrail
Suggestion Click Rate	Fraction of queries where user clicks a suggestion	Primary metric
Time to Search	Time from first keystroke to search execution	Should decrease
Query Abandonment Rate	Fraction of users who start typing but leave	Should not increase
Offensive Suggestion Rate	Fraction of shown suggestions flagged as offensive	Must be < 0.001%

Step 4: Trade-Offs & Extensions

⚡

Latency vs. Quality

A more complex model gives better suggestions but may exceed the 100ms budget. Use a cascade: serve cached results for common prefixes, and only run ML ranking for uncommon ones.

🌐

Multi-Language Support

Supporting transliteration (typing “namaste” to search in Hindi) requires a character-level model that can handle mixed-script input. This significantly increases complexity.

🔒

Privacy vs. Personalization

Stronger personalization requires more user data. Consider on-device personalization: keep user history on the client and merge with server-side popularity signals.

🌱

Generative Suggestions

Instead of only suggesting previously seen queries, use an LLM to generate novel completions. This is powerful but hard to control for safety and factuality.

← Previous Design a News Feed Ranking System Next → Design Email Spam Detection