Design YouTube Recommendations
This is the most common ML system design interview question. You need to design a recommendation system that serves 1B+ daily active users with sub-200ms latency, handling a catalog of 800M+ videos. We will walk through the complete interview answer.
Step 1: Clarify Requirements
Start by asking these scoping questions. In an actual interview, the interviewer will confirm or adjust these numbers:
Requirements (confirmed with interviewer):
Scale:
- 1B daily active users
- 800M+ videos in catalog
- Average user watches 10 videos/session, 2 sessions/day
- Peak QPS: ~500K recommendation requests/second
Latency:
- Homepage recommendations: < 200ms end-to-end
- "Up Next" sidebar: < 100ms (prefetched)
Functional:
- Personalized homepage feed (30 videos)
- "Up Next" video recommendations
- Must handle cold start (new users, new videos)
Success Metrics:
- Online: Watch time per session, click-through rate, user retention
- Offline: Recall@500 (candidate gen), NDCG@30 (ranking)
- Business: Daily active users, total watch time, ad revenue
Constraints:
- Must support 100+ languages
- Content policy: filter violating content before showing
- Diversity: avoid filter bubbles, don't show too many similar videos
Step 2: High-Level Architecture
The standard pattern for large-scale recommendation systems is a multi-stage pipeline: candidate generation narrows millions of items to hundreds, then a ranking model scores and orders them.
Architecture: YouTube Recommendation System
User Request (user_id, context)
|
v
[API Gateway] -----> [Rate Limiter, Auth]
|
v
[Recommendation Service]
|
|--- Stage 1: CANDIDATE GENERATION (retrieve ~1000 from 800M)
| |
| |--- [Two-Tower Model] ---> [ANN Index (ScaNN)]
| | Query tower: user features -> 256-dim embedding
| | Item tower: video features -> 256-dim embedding
| | ANN search: find top-1000 nearest videos
| |
| |--- [Collaborative Filtering] ---> Top-200 from similar users
| |
| |--- [Trending/Fresh] ---> Top-100 recent popular videos
| |
| Union + dedup = ~1000 candidates
|
|--- Stage 2: RANKING (score 1000 candidates -> top 30)
| |
| |--- [Feature Store (Redis)] ---> real-time user features
| | - last 50 watched videos
| | - session context (time, device, location)
| | - real-time engagement signals
| |
| |--- [Deep Ranking Model (GPU)] ---> score each candidate
| | - Multi-task: P(click), P(watch>50%), P(like), P(share)
| | - Combined score = weighted sum of predictions
| |
| Sort by combined score -> top 30
|
|--- Stage 3: POST-PROCESSING
| |--- Diversity injection (no more than 3 from same channel)
| |--- Content policy filter (remove flagged videos)
| |--- Business rules (insert ads, promoted content)
| |--- Freshness boost (uplift for recently uploaded videos)
|
v
[Response: 30 ranked videos with metadata]
OFFLINE PIPELINES:
[Training Pipeline (daily)] ---> retrain ranking model on yesterday's data
[Embedding Pipeline (daily)] ---> regenerate video embeddings, rebuild ANN index
[Feature Pipeline (real-time)] ---> Kafka -> Flink -> Redis feature store
[Monitoring] ---> track CTR, watch time, model drift, A/B test metrics
Step 3: Deep Dive — Candidate Generation
The candidate generation stage is the most critical for system efficiency. It must narrow 800M videos to ~1000 candidates in under 50ms.
Two-Tower Model Architecture
The two-tower (dual encoder) approach trains separate neural networks for users and items, mapping both into the same embedding space. At serving time, user embeddings are computed online while item embeddings are precomputed.
# Two-Tower Model for Candidate Generation
class QueryTower(nn.Module):
"""Encodes user context into a 256-dim embedding."""
def __init__(self):
super().__init__()
self.user_embedding = nn.Embedding(num_users, 64)
self.watch_history_encoder = nn.GRU(
input_size=128, hidden_size=128, num_layers=2, batch_first=True
)
self.context_mlp = nn.Sequential(
nn.Linear(64 + 128 + 32, 256), # user_emb + history + context
nn.ReLU(),
nn.Linear(256, 256),
nn.LayerNorm(256)
)
def forward(self, user_id, watch_history, context_features):
user_emb = self.user_embedding(user_id) # [B, 64]
_, history_emb = self.watch_history_encoder(watch_history) # [B, 128]
history_emb = history_emb[-1]
combined = torch.cat([user_emb, history_emb, context_features], dim=-1)
return F.normalize(self.context_mlp(combined), dim=-1) # [B, 256]
class ItemTower(nn.Module):
"""Encodes video features into a 256-dim embedding."""
def __init__(self):
super().__init__()
self.video_embedding = nn.Embedding(num_videos, 128)
self.text_encoder = nn.Linear(768, 128) # from pre-trained BERT
self.mlp = nn.Sequential(
nn.Linear(128 + 128 + 64, 256), # video_emb + text + metadata
nn.ReLU(),
nn.Linear(256, 256),
nn.LayerNorm(256)
)
def forward(self, video_id, title_embedding, metadata_features):
video_emb = self.video_embedding(video_id) # [B, 128]
text_emb = self.text_encoder(title_embedding) # [B, 128]
combined = torch.cat([video_emb, text_emb, metadata_features], dim=-1)
return F.normalize(self.mlp(combined), dim=-1) # [B, 256]
# Training: contrastive loss (in-batch negatives)
# Positive pairs: (user, video they watched)
# Negative pairs: (user, other videos in the batch)
# Loss: sampled softmax with log-Q correction for popularity bias
Scale Calculations for ANN Index
| Parameter | Value | Calculation |
|---|---|---|
| Video embeddings size | ~200 GB | 800M videos * 256 dims * 4 bytes = 819 GB raw, ~200 GB with quantization |
| ANN search latency | <10ms | ScaNN with quantized embeddings, partitioned into 10K clusters |
| Index rebuild time | ~2 hours | Distributed across 50 machines, rebuild daily with new videos |
| Query tower inference | <5ms | Small model (2M params), runs on CPU |
| Total candidate gen latency | <20ms | Query tower (5ms) + ANN search (10ms) + merge (5ms) |
Step 3: Deep Dive — Ranking Model
The ranking model scores ~1000 candidates with a richer feature set than candidate generation allows. This is where prediction quality matters most.
# Multi-Task Deep Ranking Model
class YouTubeRanker(nn.Module):
"""Multi-task ranking model predicting multiple engagement signals."""
def __init__(self):
super().__init__()
# Shared feature encoder
self.shared_encoder = nn.Sequential(
nn.Linear(feature_dim, 1024),
nn.ReLU(), nn.BatchNorm1d(1024), nn.Dropout(0.2),
nn.Linear(1024, 512),
nn.ReLU(), nn.BatchNorm1d(512), nn.Dropout(0.1),
nn.Linear(512, 256),
nn.ReLU()
)
# Task-specific heads
self.click_head = nn.Linear(256, 1) # P(click)
self.watch_head = nn.Linear(256, 1) # P(watch > 50%)
self.like_head = nn.Linear(256, 1) # P(like)
self.share_head = nn.Linear(256, 1) # P(share)
self.watch_time_head = nn.Linear(256, 1) # Expected watch time (regression)
def forward(self, features):
shared = self.shared_encoder(features)
return {
"p_click": torch.sigmoid(self.click_head(shared)),
"p_watch": torch.sigmoid(self.watch_head(shared)),
"p_like": torch.sigmoid(self.like_head(shared)),
"p_share": torch.sigmoid(self.share_head(shared)),
"watch_time": F.softplus(self.watch_time_head(shared))
}
# Combined ranking score (weights tuned via online experiments)
# score = 0.3 * p_click + 0.5 * p_watch * watch_time + 0.1 * p_like + 0.1 * p_share
# Key features for ranking:
RANKING_FEATURES = {
"user_features": [
"user_embedding (256-dim)",
"watch_history_last_50 (sequence embedding)",
"user_age_bucket, gender, country",
"subscription_channels (set embedding)",
"time_since_last_session"
],
"video_features": [
"video_embedding (256-dim)",
"title_embedding (BERT, 768-dim)",
"duration_seconds, upload_age_hours",
"channel_subscriber_count",
"historical_ctr, avg_watch_percentage",
"category_id, language"
],
"cross_features": [
"user_channel_watch_count",
"user_category_affinity_score",
"user_language_match",
"time_of_day_preference_match"
]
}
Diversity Injection
Pure relevance ranking creates filter bubbles. YouTube uses several techniques to ensure diverse recommendations:
- MMR (Maximal Marginal Relevance): After selecting each video, penalize remaining candidates that are too similar (cosine similarity > 0.8 in embedding space)
- Category caps: No more than 3 videos from the same channel, no more than 5 from the same category in top 30
- Exploration slots: Reserve 3–5 positions for serendipitous recommendations (random sampling from a broader candidate pool)
- Freshness boost: Multiply score by a freshness factor for videos uploaded in the last 24 hours to surface new content
Step 4: Trade-Offs Discussion
| Decision | Chosen Approach | Alternative | Why This Choice |
|---|---|---|---|
| Candidate gen | Two-tower + ANN | Collaborative filtering only | Two-tower handles cold start (new videos have content features), CF alone cannot |
| Ranking model | Multi-task DNN | Single-task (click prediction only) | Multi-task captures multiple engagement signals, reduces clickbait |
| Feature freshness | Real-time via Kafka+Flink | Batch-only (hourly updates) | Real-time session context dramatically improves relevance |
| ANN index | ScaNN (quantized) | FAISS (HNSW) | ScaNN provides better recall at the same latency budget for this scale |
| Model update | Daily retrain + online fine-tune | Weekly batch retrain | Daily captures trends faster; online fine-tune handles breaking news |
Failure Modes and Mitigations
Failure Scenarios:
1. ANN index is stale (rebuild failed)
Mitigation: Serve from previous day's index (data is 24-48h old, still usable)
Detection: Monitor index build pipeline, alert if > 2 hours late
2. Ranking model latency spike
Mitigation: Circuit breaker -> fall back to simpler linear model (5ms instead of 30ms)
Detection: p99 latency monitor, auto-switch at > 100ms
3. Feature store (Redis) down
Mitigation: Serve with default/average features, quality degrades ~15% but system stays up
Detection: Redis health check, feature null rate monitoring
4. New video cold start
Mitigation: Use content-based features (title BERT embedding, category, channel history)
Bootstrap: Show to small % of users, collect signals, then include in collaborative signals
5. Popularity bias (only recommend popular videos)
Mitigation: Log-Q correction in training, exploration slots, freshness boost
Monitor: Track recommendation diversity metrics (Gini coefficient, coverage)
Key Takeaways
- Use multi-stage architecture: candidate generation (800M → 1000) then ranking (1000 → 30)
- Two-tower model enables fast ANN retrieval and handles cold start via content features
- Multi-task ranking predicts multiple engagement signals to avoid optimizing for clicks alone
- Diversity injection is not optional — it prevents filter bubbles and improves long-term retention
- Always include scale calculations: storage, latency breakdown, QPS at each stage
- Discuss failure modes and fallback strategies — this separates strong from average candidates
Lilly Tech Systems