Intermediate

Design Search Autocomplete with ML

A complete walkthrough of designing an ML-powered search autocomplete system. Move beyond simple trie-based approaches to build a personalized, real-time query prediction system used at Google, Amazon, and Spotify.

Step 1: Clarify Requirements

📝
Key clarifications:
  • “What kind of search?” — General web search, e-commerce product search, or in-app search
  • “Scale?” — 10B queries/day, 100K QPS peak, suggestions must appear in <100ms
  • “Should suggestions be personalized?” — Yes, based on user search history and profile
  • “How many suggestions to show?” — Top 5–10 ranked suggestions
  • “Safety constraints?” — Must filter offensive, illegal, and sensitive suggestions

ML Problem Formulation

# Problem formulation
# Business goal:    Help users find what they want faster
# ML task:          Query completion ranking
# Input:            Partial query prefix + user context
# Output:           Ranked list of query completions
# Predict:          P(user selects completion | prefix, user, context)
# Training data:    Search logs with (prefix, completion, selected/not)
# Optimization:     Pairwise ranking loss (or listwise NDCG loss)

Step 2: High-Level Architecture

# Architecture: Retrieval + Ranking (two-stage)
#
# [User types "how to"] --> [Candidate Retrieval] --> [ML Ranking] --> [Filtering] --> [Display]
#                                  |                      |               |
#                           ~1000 candidates         ~50 scored      ~8 shown
#                           (trie + popularity)    (personalized)   (safe, diverse)
#
# Offline systems:
# [Query Log Aggregation] --> [Training Pipeline] --> [Model Registry]
# [Trie Builder] --> [Distributed Trie Shards]
# [Safety Classifier] --> [Blocked Query List]

Why Not Just a Trie?

ApproachProsCons
Trie + FrequencySimple, fast (<5ms), easy to implementNo personalization, no context, stale popularity counts
Trie + ML RankingFast retrieval + personalized rankingMore complex infrastructure
Pure ML (seq2seq)Can generate novel suggestionsToo slow for 100ms budget, hard to control safety
💡
Best approach: Use a trie (or prefix index) for fast candidate retrieval, then apply an ML model to re-rank candidates for each user. This gives you the speed of trie lookup with the quality of personalized ML ranking.

Step 3: Deep Dive — Candidate Retrieval

Building the Candidate Index

The candidate index stores all possible query completions and enables fast prefix matching.

# Candidate index construction (offline, runs daily)
#
# 1. Aggregate search logs from the last 30 days
# 2. Count query frequency: {"how to cook rice": 1.2M, ...}
# 3. Filter: remove queries with count < 100 (long tail noise)
# 4. Safety filter: remove offensive/illegal queries
# 5. Build trie with top-K completions per prefix node
# 6. Shard trie across machines by prefix hash
#
# Storage: ~500M unique queries, ~50GB trie index
# Update frequency: Daily full rebuild + hourly delta for trending

Trending Query Handling

The daily trie misses breaking news and trending topics. Add a real-time trending layer:

  • Stream search logs into a sliding window counter (last 1 hour)
  • Detect queries with velocity > 10x their 7-day average
  • Inject trending queries into the candidate set at retrieval time
  • Apply safety checks before surfacing trending suggestions

Deep Dive — ML Ranking Model

Features for Query Ranking

Feature GroupFeaturesDescription
Query Featuresquery_frequency, query_length, query_recency, query_categoryPopularity and freshness of the completion
Prefix Matchprefix_match_ratio, is_exact_prefix, edit_distanceHow well the completion matches the typed prefix
User Featurespast_queries_embedding, user_language, user_location, search_frequencyUser preferences and behavior
Context Featurestime_of_day, device_type, previous_query_in_sessionCurrent session and device context
User-Query Crossuser_searched_before, user_category_affinity, cosine_similarity(user_emb, query_emb)Personalization signals

Model Architecture

# Lightweight ranking model (must score 1000 candidates in <50ms)
#
# Option A: LambdaMART (gradient boosted trees)
#   - Fast inference (~0.01ms per candidate)
#   - Handles sparse features well
#   - Easy to interpret feature importance
#   - Recommended for v1
#
# Option B: Two-Tower Neural Network
#   - User tower: encodes user features into 64-dim vector
#   - Query tower: encodes query features into 64-dim vector
#   - Score = dot_product(user_vector, query_vector)
#   - User vector can be precomputed and cached
#   - Recommended for v2 with personalization
#
# Option C: Cross-Attention Transformer
#   - Highest quality but too slow for autocomplete latency
#   - Could work for voice search where latency budget is higher

Deep Dive — Personalization

Personalization is what makes ML-powered autocomplete significantly better than frequency-based approaches.

Personalization Signals

  • Short-term: Queries in the current session — if user searched “flights to Paris,” then “hot” should suggest “hotels in Paris” over “hot dogs”
  • Medium-term: Recent search history (last 30 days) — a user who frequently searches for recipes gets food-related suggestions
  • Long-term: User profile attributes — language, location, demographics influence suggestion relevance
  • Behavioral: Past click-through on suggestions — if user always clicks the 3rd suggestion, adjust ranking
Privacy consideration: Personalization requires storing user search history. Discuss with the interviewer how you would handle GDPR/CCPA compliance: anonymization, retention limits, opt-out mechanisms, and differential privacy for aggregate statistics.

Deep Dive — Real-Time Serving

Latency Optimization

Users expect suggestions to appear as they type. Every keystroke triggers a request, so latency is critical.

OptimizationTechniqueLatency Saved
Client-side debouncingWait 50ms after last keystroke before sending requestReduces QPS by 60%
User vector cachingPrecompute and cache user embedding at session startEliminates 10ms per request
Trie shardingShard by prefix hash, route to correct shardReduces index lookup time
Result cachingCache top suggestions for popular prefixes80% cache hit rate, <1ms
Early terminationStop scoring if top-8 candidates are confidently rankedReduces scoring time by 40%
Model quantizationINT8 quantized model for inference2x speedup with <1% quality loss

Deep Dive — Evaluation

Offline Metrics

MetricDescriptionTarget
MRR (Mean Reciprocal Rank)Average 1/rank of the selected suggestion> 0.45
Success@5Fraction of queries where selected completion is in top 5> 0.70
pMRR (personalized MRR)MRR improvement over non-personalized baseline> 5% lift
Character Savings RatioFraction of characters user did not need to type> 0.50

Online Metrics (A/B Test)

MetricDescriptionGuardrail
Suggestion Click RateFraction of queries where user clicks a suggestionPrimary metric
Time to SearchTime from first keystroke to search executionShould decrease
Query Abandonment RateFraction of users who start typing but leaveShould not increase
Offensive Suggestion RateFraction of shown suggestions flagged as offensiveMust be < 0.001%

Step 4: Trade-Offs & Extensions

Latency vs. Quality

A more complex model gives better suggestions but may exceed the 100ms budget. Use a cascade: serve cached results for common prefixes, and only run ML ranking for uncommon ones.

🌐

Multi-Language Support

Supporting transliteration (typing “namaste” to search in Hindi) requires a character-level model that can handle mixed-script input. This significantly increases complexity.

🔒

Privacy vs. Personalization

Stronger personalization requires more user data. Consider on-device personalization: keep user history on the client and merge with server-side popularity signals.

🌱

Generative Suggestions

Instead of only suggesting previously seen queries, use an LLM to generate novel completions. This is powerful but hard to control for safety and factuality.