Intermediate

Design YouTube Recommendations

This is the most common ML system design interview question. You need to design a recommendation system that serves 1B+ daily active users with sub-200ms latency, handling a catalog of 800M+ videos. We will walk through the complete interview answer.

Step 1: Clarify Requirements

Start by asking these scoping questions. In an actual interview, the interviewer will confirm or adjust these numbers:

Requirements (confirmed with interviewer):

Scale:
- 1B daily active users
- 800M+ videos in catalog
- Average user watches 10 videos/session, 2 sessions/day
- Peak QPS: ~500K recommendation requests/second

Latency:
- Homepage recommendations: < 200ms end-to-end
- "Up Next" sidebar: < 100ms (prefetched)

Functional:
- Personalized homepage feed (30 videos)
- "Up Next" video recommendations
- Must handle cold start (new users, new videos)

Success Metrics:
- Online: Watch time per session, click-through rate, user retention
- Offline: Recall@500 (candidate gen), NDCG@30 (ranking)
- Business: Daily active users, total watch time, ad revenue

Constraints:
- Must support 100+ languages
- Content policy: filter violating content before showing
- Diversity: avoid filter bubbles, don't show too many similar videos

Step 2: High-Level Architecture

The standard pattern for large-scale recommendation systems is a multi-stage pipeline: candidate generation narrows millions of items to hundreds, then a ranking model scores and orders them.

Architecture: YouTube Recommendation System

User Request (user_id, context)
    |
    v
[API Gateway] -----> [Rate Limiter, Auth]
    |
    v
[Recommendation Service]
    |
    |--- Stage 1: CANDIDATE GENERATION (retrieve ~1000 from 800M)
    |       |
    |       |--- [Two-Tower Model] ---> [ANN Index (ScaNN)]
    |       |       Query tower: user features -> 256-dim embedding
    |       |       Item tower: video features -> 256-dim embedding
    |       |       ANN search: find top-1000 nearest videos
    |       |
    |       |--- [Collaborative Filtering] ---> Top-200 from similar users
    |       |
    |       |--- [Trending/Fresh] ---> Top-100 recent popular videos
    |       |
    |       Union + dedup = ~1000 candidates
    |
    |--- Stage 2: RANKING (score 1000 candidates -> top 30)
    |       |
    |       |--- [Feature Store (Redis)] ---> real-time user features
    |       |       - last 50 watched videos
    |       |       - session context (time, device, location)
    |       |       - real-time engagement signals
    |       |
    |       |--- [Deep Ranking Model (GPU)] ---> score each candidate
    |       |       - Multi-task: P(click), P(watch>50%), P(like), P(share)
    |       |       - Combined score = weighted sum of predictions
    |       |
    |       Sort by combined score -> top 30
    |
    |--- Stage 3: POST-PROCESSING
    |       |--- Diversity injection (no more than 3 from same channel)
    |       |--- Content policy filter (remove flagged videos)
    |       |--- Business rules (insert ads, promoted content)
    |       |--- Freshness boost (uplift for recently uploaded videos)
    |
    v
[Response: 30 ranked videos with metadata]

OFFLINE PIPELINES:
    [Training Pipeline (daily)] ---> retrain ranking model on yesterday's data
    [Embedding Pipeline (daily)] ---> regenerate video embeddings, rebuild ANN index
    [Feature Pipeline (real-time)] ---> Kafka -> Flink -> Redis feature store
    [Monitoring] ---> track CTR, watch time, model drift, A/B test metrics

Step 3: Deep Dive — Candidate Generation

The candidate generation stage is the most critical for system efficiency. It must narrow 800M videos to ~1000 candidates in under 50ms.

Two-Tower Model Architecture

The two-tower (dual encoder) approach trains separate neural networks for users and items, mapping both into the same embedding space. At serving time, user embeddings are computed online while item embeddings are precomputed.

# Two-Tower Model for Candidate Generation

class QueryTower(nn.Module):
    """Encodes user context into a 256-dim embedding."""
    def __init__(self):
        super().__init__()
        self.user_embedding = nn.Embedding(num_users, 64)
        self.watch_history_encoder = nn.GRU(
            input_size=128, hidden_size=128, num_layers=2, batch_first=True
        )
        self.context_mlp = nn.Sequential(
            nn.Linear(64 + 128 + 32, 256),  # user_emb + history + context
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.LayerNorm(256)
        )

    def forward(self, user_id, watch_history, context_features):
        user_emb = self.user_embedding(user_id)           # [B, 64]
        _, history_emb = self.watch_history_encoder(watch_history)  # [B, 128]
        history_emb = history_emb[-1]
        combined = torch.cat([user_emb, history_emb, context_features], dim=-1)
        return F.normalize(self.context_mlp(combined), dim=-1)  # [B, 256]


class ItemTower(nn.Module):
    """Encodes video features into a 256-dim embedding."""
    def __init__(self):
        super().__init__()
        self.video_embedding = nn.Embedding(num_videos, 128)
        self.text_encoder = nn.Linear(768, 128)  # from pre-trained BERT
        self.mlp = nn.Sequential(
            nn.Linear(128 + 128 + 64, 256),  # video_emb + text + metadata
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.LayerNorm(256)
        )

    def forward(self, video_id, title_embedding, metadata_features):
        video_emb = self.video_embedding(video_id)     # [B, 128]
        text_emb = self.text_encoder(title_embedding)   # [B, 128]
        combined = torch.cat([video_emb, text_emb, metadata_features], dim=-1)
        return F.normalize(self.mlp(combined), dim=-1)  # [B, 256]

# Training: contrastive loss (in-batch negatives)
# Positive pairs: (user, video they watched)
# Negative pairs: (user, other videos in the batch)
# Loss: sampled softmax with log-Q correction for popularity bias

Scale Calculations for ANN Index

ParameterValueCalculation
Video embeddings size~200 GB800M videos * 256 dims * 4 bytes = 819 GB raw, ~200 GB with quantization
ANN search latency<10msScaNN with quantized embeddings, partitioned into 10K clusters
Index rebuild time~2 hoursDistributed across 50 machines, rebuild daily with new videos
Query tower inference<5msSmall model (2M params), runs on CPU
Total candidate gen latency<20msQuery tower (5ms) + ANN search (10ms) + merge (5ms)

Step 3: Deep Dive — Ranking Model

The ranking model scores ~1000 candidates with a richer feature set than candidate generation allows. This is where prediction quality matters most.

# Multi-Task Deep Ranking Model

class YouTubeRanker(nn.Module):
    """Multi-task ranking model predicting multiple engagement signals."""

    def __init__(self):
        super().__init__()
        # Shared feature encoder
        self.shared_encoder = nn.Sequential(
            nn.Linear(feature_dim, 1024),
            nn.ReLU(), nn.BatchNorm1d(1024), nn.Dropout(0.2),
            nn.Linear(1024, 512),
            nn.ReLU(), nn.BatchNorm1d(512), nn.Dropout(0.1),
            nn.Linear(512, 256),
            nn.ReLU()
        )

        # Task-specific heads
        self.click_head = nn.Linear(256, 1)        # P(click)
        self.watch_head = nn.Linear(256, 1)         # P(watch > 50%)
        self.like_head = nn.Linear(256, 1)          # P(like)
        self.share_head = nn.Linear(256, 1)         # P(share)
        self.watch_time_head = nn.Linear(256, 1)    # Expected watch time (regression)

    def forward(self, features):
        shared = self.shared_encoder(features)
        return {
            "p_click": torch.sigmoid(self.click_head(shared)),
            "p_watch": torch.sigmoid(self.watch_head(shared)),
            "p_like": torch.sigmoid(self.like_head(shared)),
            "p_share": torch.sigmoid(self.share_head(shared)),
            "watch_time": F.softplus(self.watch_time_head(shared))
        }

# Combined ranking score (weights tuned via online experiments)
# score = 0.3 * p_click + 0.5 * p_watch * watch_time + 0.1 * p_like + 0.1 * p_share

# Key features for ranking:
RANKING_FEATURES = {
    "user_features": [
        "user_embedding (256-dim)",
        "watch_history_last_50 (sequence embedding)",
        "user_age_bucket, gender, country",
        "subscription_channels (set embedding)",
        "time_since_last_session"
    ],
    "video_features": [
        "video_embedding (256-dim)",
        "title_embedding (BERT, 768-dim)",
        "duration_seconds, upload_age_hours",
        "channel_subscriber_count",
        "historical_ctr, avg_watch_percentage",
        "category_id, language"
    ],
    "cross_features": [
        "user_channel_watch_count",
        "user_category_affinity_score",
        "user_language_match",
        "time_of_day_preference_match"
    ]
}

Diversity Injection

Pure relevance ranking creates filter bubbles. YouTube uses several techniques to ensure diverse recommendations:

  • MMR (Maximal Marginal Relevance): After selecting each video, penalize remaining candidates that are too similar (cosine similarity > 0.8 in embedding space)
  • Category caps: No more than 3 videos from the same channel, no more than 5 from the same category in top 30
  • Exploration slots: Reserve 3–5 positions for serendipitous recommendations (random sampling from a broader candidate pool)
  • Freshness boost: Multiply score by a freshness factor for videos uploaded in the last 24 hours to surface new content

Step 4: Trade-Offs Discussion

DecisionChosen ApproachAlternativeWhy This Choice
Candidate genTwo-tower + ANNCollaborative filtering onlyTwo-tower handles cold start (new videos have content features), CF alone cannot
Ranking modelMulti-task DNNSingle-task (click prediction only)Multi-task captures multiple engagement signals, reduces clickbait
Feature freshnessReal-time via Kafka+FlinkBatch-only (hourly updates)Real-time session context dramatically improves relevance
ANN indexScaNN (quantized)FAISS (HNSW)ScaNN provides better recall at the same latency budget for this scale
Model updateDaily retrain + online fine-tuneWeekly batch retrainDaily captures trends faster; online fine-tune handles breaking news

Failure Modes and Mitigations

Failure Scenarios:

1. ANN index is stale (rebuild failed)
   Mitigation: Serve from previous day's index (data is 24-48h old, still usable)
   Detection: Monitor index build pipeline, alert if > 2 hours late

2. Ranking model latency spike
   Mitigation: Circuit breaker -> fall back to simpler linear model (5ms instead of 30ms)
   Detection: p99 latency monitor, auto-switch at > 100ms

3. Feature store (Redis) down
   Mitigation: Serve with default/average features, quality degrades ~15% but system stays up
   Detection: Redis health check, feature null rate monitoring

4. New video cold start
   Mitigation: Use content-based features (title BERT embedding, category, channel history)
   Bootstrap: Show to small % of users, collect signals, then include in collaborative signals

5. Popularity bias (only recommend popular videos)
   Mitigation: Log-Q correction in training, exploration slots, freshness boost
   Monitor: Track recommendation diversity metrics (Gini coefficient, coverage)

Key Takeaways

💡
  • Use multi-stage architecture: candidate generation (800M → 1000) then ranking (1000 → 30)
  • Two-tower model enables fast ANN retrieval and handles cold start via content features
  • Multi-task ranking predicts multiple engagement signals to avoid optimizing for clicks alone
  • Diversity injection is not optional — it prevents filter bubbles and improves long-term retention
  • Always include scale calculations: storage, latency breakdown, QPS at each stage
  • Discuss failure modes and fallback strategies — this separates strong from average candidates