Design Ad Click Prediction
A complete walkthrough of designing an ad click prediction system. This is arguably the highest-revenue ML application in the world — a 0.1% improvement in CTR prediction can mean billions of dollars in annual revenue for companies like Google and Meta.
Step 1: Clarify Requirements
- “What type of ads?” — Search ads (query-triggered) or display/feed ads (browsing context)
- “Scale?” — 10B ad impressions/day, 100K QPS, score 1000 ads per request in <50ms
- “What are we optimizing?” — Revenue = bid × P(click) × P(conversion), ranked by expected revenue
- “Pricing model?” — CPC (cost-per-click), so we charge on click, need accurate CTR
- “Calibration matters?” — Yes, predicted probabilities must be accurate (not just ranking order)
ML Problem Formulation
# Problem formulation
# Business goal: Maximize ad revenue while maintaining user experience
# ML task: Binary classification (click / no-click)
# Predict: P(click | user, ad, context)
# Ranking score: bid * P(click) * quality_factor
# Training data: Click/impression logs (heavily imbalanced: ~2% CTR)
# Loss function: Log loss (binary cross-entropy)
# Key requirement: Calibrated probabilities (not just correct ranking)
Step 2: High-Level Architecture
# Architecture: Ad Serving Pipeline
#
# [User visits page / types query]
# |
# [Ad Retrieval] --> Select ~10,000 eligible ads (targeting, budget, etc.)
# |
# [Lightweight Pre-Filter] --> Reduce to ~1,000 ads (simple model)
# |
# [CTR Prediction Model] --> Score each ad: P(click | user, ad, context)
# |
# [Auction Engine] --> rank = bid * P(click) * quality_factor
# |
# [Ad Selection] --> Pick top-K ads for available slots
# |
# [Ad Rendering] --> Show ads to user, log impression
# |
# [Click/No-Click] --> Log feedback for model training
#
# Offline:
# [Click Logs] --> [Feature Pipeline] --> [Model Training] --> [Model Registry]
# [Calibration Pipeline] --> [Isotonic Regression / Platt Scaling]
Step 3: Deep Dive — Feature Engineering
User Features
| Feature | Type | Description |
|---|---|---|
| user_segment | Categorical | Interest-based segments: tech, fashion, sports, etc. |
| user_historical_ctr | Numerical | User’s overall click rate on ads (smoothed) |
| user_ad_category_ctr | Numerical | User’s click rate on this ad category specifically |
| user_recency | Numerical | Days since last click on any ad |
| user_device_type | Categorical | Mobile, desktop, tablet (CTR varies 2–3x) |
| user_embedding | Dense vector | Learned user representation from interaction history |
Ad Features
| Feature | Type | Description |
|---|---|---|
| ad_historical_ctr | Numerical | Overall CTR of this ad creative (smoothed) |
| advertiser_quality | Numerical | Advertiser’s overall ad quality score |
| ad_category | Categorical | Product category: electronics, clothing, travel, etc. |
| ad_age_hours | Numerical | How long since the ad campaign started (creative fatigue) |
| ad_text_embedding | Dense vector | Embedding of ad copy and title |
| ad_image_embedding | Dense vector | Visual embedding of ad creative (CNN features) |
| landing_page_quality | Numerical | Score of ad landing page (relevance, load speed) |
Context Features
| Feature | Type | Description |
|---|---|---|
| search_query | Text embedding | For search ads: the user’s search query |
| page_context | Categorical | What page/section the ad appears on |
| ad_position | Categorical | Slot position (top vs. sidebar vs. bottom) |
| time_of_day | Cyclical | CTR varies significantly by hour |
| competitor_ads | Numerical | Number of competing ads in same auction |
Cross Features (Most Important for CTR)
| Feature | Type | Description |
|---|---|---|
| user_x_ad_category_ctr | Numerical | Historical CTR of this user on this ad category |
| query_x_ad_relevance | Numerical | Cosine similarity between query and ad embeddings |
| user_x_advertiser_history | Numerical | Past clicks/impressions for this user-advertiser pair |
| position_x_device | Categorical | Mobile top-slot CTR differs from desktop top-slot |
Deep Dive — Model Architecture
Evolution of CTR Models
# Historical progression of CTR models:
#
# V1: Logistic Regression (2010s)
# - Feature crosses done manually
# - Billions of sparse features (one-hot encoded)
# - Pros: Fast, interpretable, scales to billions of features
# - Cons: Cannot discover new feature interactions
#
# V2: GBDT + LR (Facebook, 2014)
# - GBDT extracts non-linear features
# - LR combines GBDT leaf indices with original features
# - Pros: Automatic feature interactions
# - Cons: GBDT is expensive to retrain
#
# V3: Deep Learning (2016+)
# - Wide & Deep (Google): Combines memorization and generalization
# - DeepFM: Deep network + factorization machine
# - DCN v2: Explicit cross network + deep network
# - Pros: Learns arbitrary feature interactions
# - Cons: Requires GPU serving, harder to debug
#
# V4: Two-Tower + Cross (current state of the art)
# - User tower (precomputed) + Ad tower
# - Cross-attention for fine-grained interactions
# - Pros: User embeddings cached, fast serving
# - Cons: Most complex to maintain
Recommended Architecture: DCN v2
# DCN v2 (Deep & Cross Network v2)
#
# Input: [user_features, ad_features, context_features, cross_features]
# |
# [Embedding Layer] -- sparse categoricals to dense (64-dim each)
# |
# [Cross Network] [Deep Network]
# x_1 = x_0 * x_0^T * w h_1 = ReLU(W_1 * x + b_1)
# x_2 = x_1 * x_0^T * w h_2 = ReLU(W_2 * h_1 + b_2)
# (3 cross layers) h_3 = ReLU(W_3 * h_2 + b_3)
# | |
# [Concatenate cross_output + deep_output]
# |
# [Output Layer] --> sigmoid --> P(click)
#
# Why DCN v2:
# - Cross network captures explicit feature interactions (user x ad)
# - Deep network captures implicit non-linear patterns
# - More parameter-efficient than full attention mechanisms
Deep Dive — Exploration-Exploitation
If you always show the ad with the highest predicted CTR, you never explore potentially better ads. This is the exploration-exploitation dilemma.
Strategies
| Strategy | How It Works | Trade-Off |
|---|---|---|
| Epsilon-Greedy | Show random ad with probability ε (e.g., 5%) | Simple but wastes exploration budget |
| Thompson Sampling | Sample CTR from posterior distribution, show highest sample | Better exploration but harder to implement at scale |
| Upper Confidence Bound | Score = predicted_ctr + exploration_bonus / sqrt(impressions) | Principled but requires tracking impression counts |
| New Ad Boost | Give new ads extra exposure until confidence interval is small enough | Simple, practical, used widely in production |
Deep Dive — Calibration
CTR models must produce calibrated probabilities, not just correct rankings.
# Calibration pipeline
#
# Problem: Deep learning models are often overconfident
# Solution: Post-hoc calibration
#
# Method 1: Platt Scaling
# calibrated_p = sigmoid(a * raw_p + b)
# Fit a, b on held-out calibration set
#
# Method 2: Isotonic Regression
# Non-parametric calibration: learn a monotonic mapping
# calibrated_p = isotonic_function(raw_p)
# More flexible but can overfit with small calibration sets
#
# Method 3: Temperature Scaling
# calibrated_p = sigmoid(logit(raw_p) / temperature)
# Fit temperature on held-out set
# Simplest, works well in practice
#
# Validation: Plot calibration curve
# x-axis: binned predicted probability
# y-axis: actual click rate in each bin
# Perfect calibration = diagonal line
Metrics & Evaluation
Offline Metrics
| Metric | What It Measures | Target |
|---|---|---|
| AUC-ROC | Ranking quality | > 0.78 |
| Log Loss | Probability calibration + ranking | < 0.42 |
| Calibration Error | |predicted_ctr - actual_ctr| across bins | < 2% |
| PRAUC | Precision-recall balance (for imbalanced data) | > 0.15 |
| Normalized Entropy | Log loss improvement over naive CTR baseline | > 0.05 (5% NE) |
Online Metrics (A/B Test)
| Metric | Description | Guardrail |
|---|---|---|
| Revenue per 1000 impressions (RPM) | Primary business metric | Must increase |
| CTR | Click-through rate | Must not decrease |
| Advertiser ROI | Conversion rate for advertisers | Must not decrease (long-term health) |
| User ad engagement | Do users interact with ads? | Track but not optimize directly |
| Ad load | Number of ads per page | Must not increase (user experience) |
Step 4: Trade-Offs & Extensions
Revenue vs. User Experience
Showing more ads increases short-term revenue but degrades user experience, reducing long-term engagement. Use a quality factor that penalizes low-relevance ads regardless of bid.
Real-Time vs. Near-Real-Time Features
Features like “user clicked similar ad 2 minutes ago” are powerful but expensive to compute. Use a tiered feature freshness: some features updated per-request, others hourly.
Privacy-Preserving Ads
With cookie deprecation and privacy regulations, CTR prediction must adapt. Explore federated learning, on-device prediction, and contextual targeting as alternatives to user tracking.
Multi-Objective: Click + Conversion
Predicting clicks alone can favor clickbait ads. Add a conversion prediction head and rank by expected_revenue = bid × P(click) × P(conversion|click).
Lilly Tech Systems