Advanced

Design Ad Click Prediction

A complete walkthrough of designing an ad click prediction system. This is arguably the highest-revenue ML application in the world — a 0.1% improvement in CTR prediction can mean billions of dollars in annual revenue for companies like Google and Meta.

Step 1: Clarify Requirements

📝
Key clarifications:
  • “What type of ads?” — Search ads (query-triggered) or display/feed ads (browsing context)
  • “Scale?” — 10B ad impressions/day, 100K QPS, score 1000 ads per request in <50ms
  • “What are we optimizing?” — Revenue = bid × P(click) × P(conversion), ranked by expected revenue
  • “Pricing model?” — CPC (cost-per-click), so we charge on click, need accurate CTR
  • “Calibration matters?” — Yes, predicted probabilities must be accurate (not just ranking order)

ML Problem Formulation

# Problem formulation
# Business goal:    Maximize ad revenue while maintaining user experience
# ML task:          Binary classification (click / no-click)
# Predict:          P(click | user, ad, context)
# Ranking score:    bid * P(click) * quality_factor
# Training data:    Click/impression logs (heavily imbalanced: ~2% CTR)
# Loss function:    Log loss (binary cross-entropy)
# Key requirement:  Calibrated probabilities (not just correct ranking)
Why calibration matters: In ad ranking, we multiply P(click) by bid amount. If P(click) is systematically too high, low-bid ads unfairly beat high-bid ads. If too low, we undercharge advertisers and lose revenue. The predicted probability must closely match the actual click rate.

Step 2: High-Level Architecture

# Architecture: Ad Serving Pipeline
#
# [User visits page / types query]
#   |
# [Ad Retrieval] --> Select ~10,000 eligible ads (targeting, budget, etc.)
#   |
# [Lightweight Pre-Filter] --> Reduce to ~1,000 ads (simple model)
#   |
# [CTR Prediction Model] --> Score each ad: P(click | user, ad, context)
#   |
# [Auction Engine] --> rank = bid * P(click) * quality_factor
#   |
# [Ad Selection] --> Pick top-K ads for available slots
#   |
# [Ad Rendering] --> Show ads to user, log impression
#   |
# [Click/No-Click] --> Log feedback for model training
#
# Offline:
# [Click Logs] --> [Feature Pipeline] --> [Model Training] --> [Model Registry]
# [Calibration Pipeline] --> [Isotonic Regression / Platt Scaling]

Step 3: Deep Dive — Feature Engineering

User Features

FeatureTypeDescription
user_segmentCategoricalInterest-based segments: tech, fashion, sports, etc.
user_historical_ctrNumericalUser’s overall click rate on ads (smoothed)
user_ad_category_ctrNumericalUser’s click rate on this ad category specifically
user_recencyNumericalDays since last click on any ad
user_device_typeCategoricalMobile, desktop, tablet (CTR varies 2–3x)
user_embeddingDense vectorLearned user representation from interaction history

Ad Features

FeatureTypeDescription
ad_historical_ctrNumericalOverall CTR of this ad creative (smoothed)
advertiser_qualityNumericalAdvertiser’s overall ad quality score
ad_categoryCategoricalProduct category: electronics, clothing, travel, etc.
ad_age_hoursNumericalHow long since the ad campaign started (creative fatigue)
ad_text_embeddingDense vectorEmbedding of ad copy and title
ad_image_embeddingDense vectorVisual embedding of ad creative (CNN features)
landing_page_qualityNumericalScore of ad landing page (relevance, load speed)

Context Features

FeatureTypeDescription
search_queryText embeddingFor search ads: the user’s search query
page_contextCategoricalWhat page/section the ad appears on
ad_positionCategoricalSlot position (top vs. sidebar vs. bottom)
time_of_dayCyclicalCTR varies significantly by hour
competitor_adsNumericalNumber of competing ads in same auction

Cross Features (Most Important for CTR)

FeatureTypeDescription
user_x_ad_category_ctrNumericalHistorical CTR of this user on this ad category
query_x_ad_relevanceNumericalCosine similarity between query and ad embeddings
user_x_advertiser_historyNumericalPast clicks/impressions for this user-advertiser pair
position_x_deviceCategoricalMobile top-slot CTR differs from desktop top-slot
💡
Feature interaction is king: In CTR prediction, cross features (user-ad interactions) provide the most lift. A user who loves electronics and an electronics ad is a powerful signal. The model must capture these interactions efficiently — that is why architectures like DeepFM and DCN are popular.

Deep Dive — Model Architecture

Evolution of CTR Models

# Historical progression of CTR models:
#
# V1: Logistic Regression (2010s)
#     - Feature crosses done manually
#     - Billions of sparse features (one-hot encoded)
#     - Pros: Fast, interpretable, scales to billions of features
#     - Cons: Cannot discover new feature interactions
#
# V2: GBDT + LR (Facebook, 2014)
#     - GBDT extracts non-linear features
#     - LR combines GBDT leaf indices with original features
#     - Pros: Automatic feature interactions
#     - Cons: GBDT is expensive to retrain
#
# V3: Deep Learning (2016+)
#     - Wide & Deep (Google): Combines memorization and generalization
#     - DeepFM: Deep network + factorization machine
#     - DCN v2: Explicit cross network + deep network
#     - Pros: Learns arbitrary feature interactions
#     - Cons: Requires GPU serving, harder to debug
#
# V4: Two-Tower + Cross (current state of the art)
#     - User tower (precomputed) + Ad tower
#     - Cross-attention for fine-grained interactions
#     - Pros: User embeddings cached, fast serving
#     - Cons: Most complex to maintain

Recommended Architecture: DCN v2

# DCN v2 (Deep & Cross Network v2)
#
# Input: [user_features, ad_features, context_features, cross_features]
#   |
# [Embedding Layer] -- sparse categoricals to dense (64-dim each)
#   |
# [Cross Network]        [Deep Network]
# x_1 = x_0 * x_0^T * w  h_1 = ReLU(W_1 * x + b_1)
# x_2 = x_1 * x_0^T * w  h_2 = ReLU(W_2 * h_1 + b_2)
# (3 cross layers)        h_3 = ReLU(W_3 * h_2 + b_3)
#   |                       |
# [Concatenate cross_output + deep_output]
#   |
# [Output Layer] --> sigmoid --> P(click)
#
# Why DCN v2:
# - Cross network captures explicit feature interactions (user x ad)
# - Deep network captures implicit non-linear patterns
# - More parameter-efficient than full attention mechanisms

Deep Dive — Exploration-Exploitation

If you always show the ad with the highest predicted CTR, you never explore potentially better ads. This is the exploration-exploitation dilemma.

Strategies

StrategyHow It WorksTrade-Off
Epsilon-GreedyShow random ad with probability ε (e.g., 5%)Simple but wastes exploration budget
Thompson SamplingSample CTR from posterior distribution, show highest sampleBetter exploration but harder to implement at scale
Upper Confidence BoundScore = predicted_ctr + exploration_bonus / sqrt(impressions)Principled but requires tracking impression counts
New Ad BoostGive new ads extra exposure until confidence interval is small enoughSimple, practical, used widely in production
💡
Cold-start ads: New ads have no click history, so their predicted CTR is unreliable. Use a “new ad boost”: give new ads extra impressions during their first 1000 impressions, then let the model take over. This is a practical form of exploration that advertisers expect.

Deep Dive — Calibration

CTR models must produce calibrated probabilities, not just correct rankings.

# Calibration pipeline
#
# Problem: Deep learning models are often overconfident
# Solution: Post-hoc calibration
#
# Method 1: Platt Scaling
#   calibrated_p = sigmoid(a * raw_p + b)
#   Fit a, b on held-out calibration set
#
# Method 2: Isotonic Regression
#   Non-parametric calibration: learn a monotonic mapping
#   calibrated_p = isotonic_function(raw_p)
#   More flexible but can overfit with small calibration sets
#
# Method 3: Temperature Scaling
#   calibrated_p = sigmoid(logit(raw_p) / temperature)
#   Fit temperature on held-out set
#   Simplest, works well in practice
#
# Validation: Plot calibration curve
#   x-axis: binned predicted probability
#   y-axis: actual click rate in each bin
#   Perfect calibration = diagonal line

Metrics & Evaluation

Offline Metrics

MetricWhat It MeasuresTarget
AUC-ROCRanking quality> 0.78
Log LossProbability calibration + ranking< 0.42
Calibration Error|predicted_ctr - actual_ctr| across bins< 2%
PRAUCPrecision-recall balance (for imbalanced data)> 0.15
Normalized EntropyLog loss improvement over naive CTR baseline> 0.05 (5% NE)

Online Metrics (A/B Test)

MetricDescriptionGuardrail
Revenue per 1000 impressions (RPM)Primary business metricMust increase
CTRClick-through rateMust not decrease
Advertiser ROIConversion rate for advertisersMust not decrease (long-term health)
User ad engagementDo users interact with ads?Track but not optimize directly
Ad loadNumber of ads per pageMust not increase (user experience)

Step 4: Trade-Offs & Extensions

💸

Revenue vs. User Experience

Showing more ads increases short-term revenue but degrades user experience, reducing long-term engagement. Use a quality factor that penalizes low-relevance ads regardless of bid.

Real-Time vs. Near-Real-Time Features

Features like “user clicked similar ad 2 minutes ago” are powerful but expensive to compute. Use a tiered feature freshness: some features updated per-request, others hourly.

🔒

Privacy-Preserving Ads

With cookie deprecation and privacy regulations, CTR prediction must adapt. Explore federated learning, on-device prediction, and contextual targeting as alternatives to user tracking.

📈

Multi-Objective: Click + Conversion

Predicting clicks alone can favor clickbait ads. Add a conversion prediction head and rank by expected_revenue = bid × P(click) × P(conversion|click).