Advanced

Design Ad Click Prediction

A complete walkthrough of designing an ad click prediction system. This is arguably the highest-revenue ML application in the world — a 0.1% improvement in CTR prediction can mean billions of dollars in annual revenue for companies like Google and Meta.

Step 1: Clarify Requirements

📝

Key clarifications:

“What type of ads?” — Search ads (query-triggered) or display/feed ads (browsing context)
“Scale?” — 10B ad impressions/day, 100K QPS, score 1000 ads per request in <50ms
“What are we optimizing?” — Revenue = bid × P(click) × P(conversion), ranked by expected revenue
“Pricing model?” — CPC (cost-per-click), so we charge on click, need accurate CTR
“Calibration matters?” — Yes, predicted probabilities must be accurate (not just ranking order)

ML Problem Formulation

# Problem formulation
# Business goal:    Maximize ad revenue while maintaining user experience
# ML task:          Binary classification (click / no-click)
# Predict:          P(click | user, ad, context)
# Ranking score:    bid * P(click) * quality_factor
# Training data:    Click/impression logs (heavily imbalanced: ~2% CTR)
# Loss function:    Log loss (binary cross-entropy)
# Key requirement:  Calibrated probabilities (not just correct ranking)

⚠

Why calibration matters: In ad ranking, we multiply P(click) by bid amount. If P(click) is systematically too high, low-bid ads unfairly beat high-bid ads. If too low, we undercharge advertisers and lose revenue. The predicted probability must closely match the actual click rate.

Step 2: High-Level Architecture

# Architecture: Ad Serving Pipeline
#
# [User visits page / types query]
#   |
# [Ad Retrieval] --> Select ~10,000 eligible ads (targeting, budget, etc.)
#   |
# [Lightweight Pre-Filter] --> Reduce to ~1,000 ads (simple model)
#   |
# [CTR Prediction Model] --> Score each ad: P(click | user, ad, context)
#   |
# [Auction Engine] --> rank = bid * P(click) * quality_factor
#   |
# [Ad Selection] --> Pick top-K ads for available slots
#   |
# [Ad Rendering] --> Show ads to user, log impression
#   |
# [Click/No-Click] --> Log feedback for model training
#
# Offline:
# [Click Logs] --> [Feature Pipeline] --> [Model Training] --> [Model Registry]
# [Calibration Pipeline] --> [Isotonic Regression / Platt Scaling]

Step 3: Deep Dive — Feature Engineering

User Features

Feature	Type	Description
user_segment	Categorical	Interest-based segments: tech, fashion, sports, etc.
user_historical_ctr	Numerical	User’s overall click rate on ads (smoothed)
user_ad_category_ctr	Numerical	User’s click rate on this ad category specifically
user_recency	Numerical	Days since last click on any ad
user_device_type	Categorical	Mobile, desktop, tablet (CTR varies 2–3x)
user_embedding	Dense vector	Learned user representation from interaction history

Ad Features

Feature	Type	Description
ad_historical_ctr	Numerical	Overall CTR of this ad creative (smoothed)
advertiser_quality	Numerical	Advertiser’s overall ad quality score
ad_category	Categorical	Product category: electronics, clothing, travel, etc.
ad_age_hours	Numerical	How long since the ad campaign started (creative fatigue)
ad_text_embedding	Dense vector	Embedding of ad copy and title
ad_image_embedding	Dense vector	Visual embedding of ad creative (CNN features)
landing_page_quality	Numerical	Score of ad landing page (relevance, load speed)

Context Features

Feature	Type	Description
search_query	Text embedding	For search ads: the user’s search query
page_context	Categorical	What page/section the ad appears on
ad_position	Categorical	Slot position (top vs. sidebar vs. bottom)
time_of_day	Cyclical	CTR varies significantly by hour
competitor_ads	Numerical	Number of competing ads in same auction

Cross Features (Most Important for CTR)

Feature	Type	Description
user_x_ad_category_ctr	Numerical	Historical CTR of this user on this ad category
query_x_ad_relevance	Numerical	Cosine similarity between query and ad embeddings
user_x_advertiser_history	Numerical	Past clicks/impressions for this user-advertiser pair
position_x_device	Categorical	Mobile top-slot CTR differs from desktop top-slot

💡

Feature interaction is king: In CTR prediction, cross features (user-ad interactions) provide the most lift. A user who loves electronics and an electronics ad is a powerful signal. The model must capture these interactions efficiently — that is why architectures like DeepFM and DCN are popular.

Deep Dive — Model Architecture

Evolution of CTR Models

# Historical progression of CTR models:
#
# V1: Logistic Regression (2010s)
#     - Feature crosses done manually
#     - Billions of sparse features (one-hot encoded)
#     - Pros: Fast, interpretable, scales to billions of features
#     - Cons: Cannot discover new feature interactions
#
# V2: GBDT + LR (Facebook, 2014)
#     - GBDT extracts non-linear features
#     - LR combines GBDT leaf indices with original features
#     - Pros: Automatic feature interactions
#     - Cons: GBDT is expensive to retrain
#
# V3: Deep Learning (2016+)
#     - Wide & Deep (Google): Combines memorization and generalization
#     - DeepFM: Deep network + factorization machine
#     - DCN v2: Explicit cross network + deep network
#     - Pros: Learns arbitrary feature interactions
#     - Cons: Requires GPU serving, harder to debug
#
# V4: Two-Tower + Cross (current state of the art)
#     - User tower (precomputed) + Ad tower
#     - Cross-attention for fine-grained interactions
#     - Pros: User embeddings cached, fast serving
#     - Cons: Most complex to maintain

Recommended Architecture: DCN v2

# DCN v2 (Deep & Cross Network v2)
#
# Input: [user_features, ad_features, context_features, cross_features]
#   |
# [Embedding Layer] -- sparse categoricals to dense (64-dim each)
#   |
# [Cross Network]        [Deep Network]
# x_1 = x_0 * x_0^T * w  h_1 = ReLU(W_1 * x + b_1)
# x_2 = x_1 * x_0^T * w  h_2 = ReLU(W_2 * h_1 + b_2)
# (3 cross layers)        h_3 = ReLU(W_3 * h_2 + b_3)
#   |                       |
# [Concatenate cross_output + deep_output]
#   |
# [Output Layer] --> sigmoid --> P(click)
#
# Why DCN v2:
# - Cross network captures explicit feature interactions (user x ad)
# - Deep network captures implicit non-linear patterns
# - More parameter-efficient than full attention mechanisms

Deep Dive — Exploration-Exploitation

If you always show the ad with the highest predicted CTR, you never explore potentially better ads. This is the exploration-exploitation dilemma.

Strategies

Strategy	How It Works	Trade-Off
Epsilon-Greedy	Show random ad with probability ε (e.g., 5%)	Simple but wastes exploration budget
Thompson Sampling	Sample CTR from posterior distribution, show highest sample	Better exploration but harder to implement at scale
Upper Confidence Bound	Score = predicted_ctr + exploration_bonus / sqrt(impressions)	Principled but requires tracking impression counts
New Ad Boost	Give new ads extra exposure until confidence interval is small enough	Simple, practical, used widely in production

💡

Cold-start ads: New ads have no click history, so their predicted CTR is unreliable. Use a “new ad boost”: give new ads extra impressions during their first 1000 impressions, then let the model take over. This is a practical form of exploration that advertisers expect.

Deep Dive — Calibration

CTR models must produce calibrated probabilities, not just correct rankings.

# Calibration pipeline
#
# Problem: Deep learning models are often overconfident
# Solution: Post-hoc calibration
#
# Method 1: Platt Scaling
#   calibrated_p = sigmoid(a * raw_p + b)
#   Fit a, b on held-out calibration set
#
# Method 2: Isotonic Regression
#   Non-parametric calibration: learn a monotonic mapping
#   calibrated_p = isotonic_function(raw_p)
#   More flexible but can overfit with small calibration sets
#
# Method 3: Temperature Scaling
#   calibrated_p = sigmoid(logit(raw_p) / temperature)
#   Fit temperature on held-out set
#   Simplest, works well in practice
#
# Validation: Plot calibration curve
#   x-axis: binned predicted probability
#   y-axis: actual click rate in each bin
#   Perfect calibration = diagonal line

Metrics & Evaluation

Offline Metrics

Metric	What It Measures	Target
AUC-ROC	Ranking quality	> 0.78
Log Loss	Probability calibration + ranking	< 0.42
Calibration Error	\|predicted_ctr - actual_ctr\| across bins	< 2%
PRAUC	Precision-recall balance (for imbalanced data)	> 0.15
Normalized Entropy	Log loss improvement over naive CTR baseline	> 0.05 (5% NE)

Online Metrics (A/B Test)

Metric	Description	Guardrail
Revenue per 1000 impressions (RPM)	Primary business metric	Must increase
CTR	Click-through rate	Must not decrease
Advertiser ROI	Conversion rate for advertisers	Must not decrease (long-term health)
User ad engagement	Do users interact with ads?	Track but not optimize directly
Ad load	Number of ads per page	Must not increase (user experience)

Step 4: Trade-Offs & Extensions

💸

Revenue vs. User Experience

Showing more ads increases short-term revenue but degrades user experience, reducing long-term engagement. Use a quality factor that penalizes low-relevance ads regardless of bid.

⚡

Real-Time vs. Near-Real-Time Features

Features like “user clicked similar ad 2 minutes ago” are powerful but expensive to compute. Use a tiered feature freshness: some features updated per-request, others hourly.

🔒

Privacy-Preserving Ads

With cookie deprecation and privacy regulations, CTR prediction must adapt. Explore federated learning, on-device prediction, and contextual targeting as alternatives to user tracking.

📈

Multi-Objective: Click + Conversion

Predicting clicks alone can favor clickbait ads. Add a conversion prediction head and rank by expected_revenue = bid × P(click) × P(conversion|click).

← Previous Design Ride ETA Prediction Next → Design Content Moderation System