Design Search Autocomplete with ML
A complete walkthrough of designing an ML-powered search autocomplete system. Move beyond simple trie-based approaches to build a personalized, real-time query prediction system used at Google, Amazon, and Spotify.
Step 1: Clarify Requirements
- “What kind of search?” — General web search, e-commerce product search, or in-app search
- “Scale?” — 10B queries/day, 100K QPS peak, suggestions must appear in <100ms
- “Should suggestions be personalized?” — Yes, based on user search history and profile
- “How many suggestions to show?” — Top 5–10 ranked suggestions
- “Safety constraints?” — Must filter offensive, illegal, and sensitive suggestions
ML Problem Formulation
# Problem formulation
# Business goal: Help users find what they want faster
# ML task: Query completion ranking
# Input: Partial query prefix + user context
# Output: Ranked list of query completions
# Predict: P(user selects completion | prefix, user, context)
# Training data: Search logs with (prefix, completion, selected/not)
# Optimization: Pairwise ranking loss (or listwise NDCG loss)
Step 2: High-Level Architecture
# Architecture: Retrieval + Ranking (two-stage)
#
# [User types "how to"] --> [Candidate Retrieval] --> [ML Ranking] --> [Filtering] --> [Display]
# | | |
# ~1000 candidates ~50 scored ~8 shown
# (trie + popularity) (personalized) (safe, diverse)
#
# Offline systems:
# [Query Log Aggregation] --> [Training Pipeline] --> [Model Registry]
# [Trie Builder] --> [Distributed Trie Shards]
# [Safety Classifier] --> [Blocked Query List]
Why Not Just a Trie?
| Approach | Pros | Cons |
|---|---|---|
| Trie + Frequency | Simple, fast (<5ms), easy to implement | No personalization, no context, stale popularity counts |
| Trie + ML Ranking | Fast retrieval + personalized ranking | More complex infrastructure |
| Pure ML (seq2seq) | Can generate novel suggestions | Too slow for 100ms budget, hard to control safety |
Step 3: Deep Dive — Candidate Retrieval
Building the Candidate Index
The candidate index stores all possible query completions and enables fast prefix matching.
# Candidate index construction (offline, runs daily)
#
# 1. Aggregate search logs from the last 30 days
# 2. Count query frequency: {"how to cook rice": 1.2M, ...}
# 3. Filter: remove queries with count < 100 (long tail noise)
# 4. Safety filter: remove offensive/illegal queries
# 5. Build trie with top-K completions per prefix node
# 6. Shard trie across machines by prefix hash
#
# Storage: ~500M unique queries, ~50GB trie index
# Update frequency: Daily full rebuild + hourly delta for trending
Trending Query Handling
The daily trie misses breaking news and trending topics. Add a real-time trending layer:
- Stream search logs into a sliding window counter (last 1 hour)
- Detect queries with velocity > 10x their 7-day average
- Inject trending queries into the candidate set at retrieval time
- Apply safety checks before surfacing trending suggestions
Deep Dive — ML Ranking Model
Features for Query Ranking
| Feature Group | Features | Description |
|---|---|---|
| Query Features | query_frequency, query_length, query_recency, query_category | Popularity and freshness of the completion |
| Prefix Match | prefix_match_ratio, is_exact_prefix, edit_distance | How well the completion matches the typed prefix |
| User Features | past_queries_embedding, user_language, user_location, search_frequency | User preferences and behavior |
| Context Features | time_of_day, device_type, previous_query_in_session | Current session and device context |
| User-Query Cross | user_searched_before, user_category_affinity, cosine_similarity(user_emb, query_emb) | Personalization signals |
Model Architecture
# Lightweight ranking model (must score 1000 candidates in <50ms)
#
# Option A: LambdaMART (gradient boosted trees)
# - Fast inference (~0.01ms per candidate)
# - Handles sparse features well
# - Easy to interpret feature importance
# - Recommended for v1
#
# Option B: Two-Tower Neural Network
# - User tower: encodes user features into 64-dim vector
# - Query tower: encodes query features into 64-dim vector
# - Score = dot_product(user_vector, query_vector)
# - User vector can be precomputed and cached
# - Recommended for v2 with personalization
#
# Option C: Cross-Attention Transformer
# - Highest quality but too slow for autocomplete latency
# - Could work for voice search where latency budget is higher
Deep Dive — Personalization
Personalization is what makes ML-powered autocomplete significantly better than frequency-based approaches.
Personalization Signals
- Short-term: Queries in the current session — if user searched “flights to Paris,” then “hot” should suggest “hotels in Paris” over “hot dogs”
- Medium-term: Recent search history (last 30 days) — a user who frequently searches for recipes gets food-related suggestions
- Long-term: User profile attributes — language, location, demographics influence suggestion relevance
- Behavioral: Past click-through on suggestions — if user always clicks the 3rd suggestion, adjust ranking
Deep Dive — Real-Time Serving
Latency Optimization
Users expect suggestions to appear as they type. Every keystroke triggers a request, so latency is critical.
| Optimization | Technique | Latency Saved |
|---|---|---|
| Client-side debouncing | Wait 50ms after last keystroke before sending request | Reduces QPS by 60% |
| User vector caching | Precompute and cache user embedding at session start | Eliminates 10ms per request |
| Trie sharding | Shard by prefix hash, route to correct shard | Reduces index lookup time |
| Result caching | Cache top suggestions for popular prefixes | 80% cache hit rate, <1ms |
| Early termination | Stop scoring if top-8 candidates are confidently ranked | Reduces scoring time by 40% |
| Model quantization | INT8 quantized model for inference | 2x speedup with <1% quality loss |
Deep Dive — Evaluation
Offline Metrics
| Metric | Description | Target |
|---|---|---|
| MRR (Mean Reciprocal Rank) | Average 1/rank of the selected suggestion | > 0.45 |
| Success@5 | Fraction of queries where selected completion is in top 5 | > 0.70 |
| pMRR (personalized MRR) | MRR improvement over non-personalized baseline | > 5% lift |
| Character Savings Ratio | Fraction of characters user did not need to type | > 0.50 |
Online Metrics (A/B Test)
| Metric | Description | Guardrail |
|---|---|---|
| Suggestion Click Rate | Fraction of queries where user clicks a suggestion | Primary metric |
| Time to Search | Time from first keystroke to search execution | Should decrease |
| Query Abandonment Rate | Fraction of users who start typing but leave | Should not increase |
| Offensive Suggestion Rate | Fraction of shown suggestions flagged as offensive | Must be < 0.001% |
Step 4: Trade-Offs & Extensions
Latency vs. Quality
A more complex model gives better suggestions but may exceed the 100ms budget. Use a cascade: serve cached results for common prefixes, and only run ML ranking for uncommon ones.
Multi-Language Support
Supporting transliteration (typing “namaste” to search in Hindi) requires a character-level model that can handle mixed-script input. This significantly increases complexity.
Privacy vs. Personalization
Stronger personalization requires more user data. Consider on-device personalization: keep user history on the client and merge with server-side popularity signals.
Generative Suggestions
Instead of only suggesting previously seen queries, use an LLM to generate novel completions. This is powerful but hard to control for safety and factuality.
Lilly Tech Systems