Advanced

Design Content Moderation System

A complete walkthrough of designing an ML-powered content moderation system. This is one of the most impactful and nuanced ML systems in production — it must handle text, images, and video across dozens of languages while balancing free expression with safety at the scale of billions of posts per day.

Step 1: Clarify Requirements

📝

Key clarifications:

“What content types?” — Text posts, images, videos, comments, live streams
“Scale?” — 5B posts/day, 500K posts/second peak, 50+ languages
“What policy categories?” — Hate speech, violence, nudity, spam, misinformation, self-harm, terrorism
“Latency requirements?” — Pre-publish: <500ms; Reactive (reported): within 1 hour
“What actions?” — Remove, reduce distribution, add warning label, send to human review
“Appeal process?” — Yes, users can appeal decisions

ML Problem Formulation

# Problem formulation
# Business goal:    Keep platform safe while minimizing over-enforcement
# ML task:          Multi-label classification (content can violate multiple policies)
# Input:            Text + images + video frames + metadata + context
# Output:           Per-policy violation probability + severity score
# Labels:           hate_speech, violence, nudity, spam, misinformation, ...
# Training data:    Human-reviewed content with policy labels
# Loss function:    Weighted multi-label BCE (weight by severity and FP cost)
# Key constraint:   Must support 50+ languages; policies vary by region

Step 2: High-Level Architecture

# Architecture: Multi-Layer Content Moderation
#
# [New Post Created]
#   |
# [Layer 1: Hash Matching]      --> Match known violating content (perceptual hash)
#   | (0-strike: exact match)     ~5ms, catches re-uploads of removed content
#   |
# [Layer 2: Rule Engine]        --> Keyword blocklists, regex patterns
#   | (fast, high precision)      ~10ms, catches obvious violations
#   |
# [Layer 3: ML Classifiers]     --> Deep learning models per modality
#   |-- Text classifier           ~50ms (multilingual BERT)
#   |-- Image classifier          ~100ms (Vision Transformer)
#   |-- Video classifier          ~500ms (sample frames + audio)
#   |
# [Decision Engine]             --> Combine scores + policy rules
#   |
# [Action Router]
#   |-- High confidence violation: Auto-remove
#   |-- Medium confidence: Reduce distribution + queue for review
#   |-- Low confidence: No action (but log for training)
#   |-- User report: Prioritize in human review queue
#   |
# [Human Review Queue]          --> Expert reviewers make final decision
#   |
# [Feedback Loop]               --> Review decisions become training labels

Step 3: Deep Dive — Multi-Modal Classification

Text Classification

# Text moderation model
#
# Architecture: Fine-tuned multilingual BERT (XLM-RoBERTa)
#   - Handles 100+ languages in a single model
#   - Fine-tuned on platform-specific labeled data
#   - Multi-head output: one head per policy category
#
# Input processing:
#   1. Normalize text (Unicode normalization, emoji handling)
#   2. Detect language (route to language-specific rules)
#   3. Tokenize with sentencepiece (handles all scripts)
#   4. Truncate to 512 tokens (or chunk long posts)
#
# Challenges:
#   - Sarcasm and context: "I'm going to kill it at my exam"
#   - Code-switching: mixing languages in one post
#   - Adversarial obfuscation: "h@te" instead of "hate"
#   - Cultural context: gesture meanings vary by culture

Image Classification

# Image moderation model
#
# Architecture: Vision Transformer (ViT-L/14) fine-tuned for safety
#   - Pre-trained on large image dataset
#   - Fine-tuned on safety-labeled images (10M+ labeled images)
#   - Multi-label output: nudity, violence, hate_symbols, self_harm, ...
#
# Additional image analysis:
#   - OCR: Extract text from images (memes, screenshots)
#     --> Feed extracted text to text classifier
#   - Object detection: Weapons, drugs, hate symbols
#   - Face analysis: Detect if real person (for non-consensual imagery)
#
# Key challenge: Context matters
#   - Medical/educational nudity vs. sexual content
#   - News photography of violence vs. glorification
#   - Historical images with hate symbols vs. hate speech

Video Classification

# Video moderation pipeline
#
# Challenge: Processing full video at upload time is too expensive
#
# Strategy: Multi-resolution analysis
#   1. Thumbnail scan: Check cover frame (~50ms)
#   2. Key frame extraction: Sample 1 frame/second, run image classifier
#   3. Audio analysis: Speech-to-text + audio classifier (music, screaming)
#   4. Scene change detection: Focus on transitions (where violations hide)
#
# For live streams (highest urgency):
#   - Sample frames every 5 seconds
#   - Lower confidence threshold (err on side of caution)
#   - Auto-terminate stream if high-severity violation detected
#   - Human reviewer joins within 2 minutes for flagged streams

Multi-Modal Fusion

# Combining signals from text + image + video
#
# Option A: Late fusion (independent models, combine scores)
#   final_score = w1 * text_score + w2 * image_score + w3 * video_score
#   Pros: Simple, each model can be updated independently
#   Cons: Misses cross-modal context
#
# Option B: Early fusion (single multi-modal model)
#   CLIP-like architecture: shared embedding space for text + image
#   Pros: Understands "this text + this image together is harmful"
#   Cons: Harder to train, single point of failure
#
# Recommended: Late fusion for V1, early fusion for high-priority categories
#   - Hate speech memes: text alone is fine, image alone is fine,
#     but text + image together is hateful --> needs early fusion

Deep Dive — Human Review Loop

Human review is essential for content moderation. No ML system is accurate enough to make all decisions autonomously, especially for nuanced policy categories.

Review Queue Design

Priority	Content Type	SLA	Routing
P0 - Critical	Child safety, terrorism, imminent threats	< 15 minutes	Specialized reviewers, auto-escalate to law enforcement
P1 - High	Graphic violence, non-consensual imagery	< 1 hour	Senior reviewers with domain expertise
P2 - Medium	Hate speech, harassment, misinformation	< 24 hours	General reviewers with policy training
P3 - Low	Spam, minor policy violations, appeals	< 72 hours	General reviewers or ML re-scoring

Active Learning for Review Efficiency

# Problem: We can't review all 5B posts/day (would need millions of reviewers)
# Solution: Smart routing to maximize review impact
#
# Priority scoring:
#   review_priority = model_uncertainty * content_virality * severity_weight
#
# Uncertainty sampling:
#   - Model predicts P(violation) = 0.52 --> high uncertainty, needs review
#   - Model predicts P(violation) = 0.99 --> auto-remove, no review needed
#   - Model predicts P(violation) = 0.01 --> allow, no review needed
#
# Review decisions feed back into training:
#   - Reviewer labels become training data
#   - Focus on cases where model is uncertain = most informative labels
#   - This is active learning: model improves fastest on its weakest areas

Deep Dive — Policy Engine

The policy engine sits between ML classifiers and action routing. It translates policy rules into system behavior.

# Policy engine architecture
#
# Policies are configured as rules, not hardcoded:
# {
#   "policy": "nudity",
#   "regions": ["US", "EU"],
#   "thresholds": {
#     "auto_remove": 0.95,
#     "reduce_distribution": 0.80,
#     "human_review": 0.50,
#     "allow": 0.00
#   },
#   "exceptions": ["medical_context", "art", "education"],
#   "actions": {
#     "remove": {"notify_user": true, "allow_appeal": true},
#     "reduce": {"suppress_from_recommendations": true}
#   }
# }
#
# Why a policy engine (not hardcoded)?
# - Policies change frequently (new regulations, new threats)
# - Policies vary by country (EU vs. US vs. India)
# - Non-engineers (policy team) can update thresholds without code deploys
# - A/B test different thresholds to find optimal operating points

Deep Dive — Scaling to Billions

Cost Optimization

Strategy	Description	Impact
Cascade architecture	Cheap models first, expensive models only for uncertain cases	70% of content classified by fast model, 30% need deep model
Hash deduplication	If identical content was already classified, reuse the result	Saves 20–30% of compute (viral reposts)
Model distillation	Distill large teacher model into smaller student for serving	5x faster inference with <2% quality loss
Batch processing	For non-urgent content, batch GPU inference	Higher GPU utilization, lower cost per item
Regional processing	Process content in the region where it was posted	Reduces data transfer costs and latency

Metrics & Evaluation

Offline Metrics

Metric	Target	Why
Precision (per category)	> 90%	Over-enforcement erodes user trust
Recall (per category)	> 95% for P0, > 85% for P2	Missing severe violations is unacceptable
False positive rate	< 1%	Wrongly removing legitimate content causes backlash
Cross-language parity	< 5% performance gap	Non-English content must be equally well-moderated

Online Metrics

Metric	Description	Guardrail
Violating view rate	% of content views on violating content before removal	Should decrease over time
Time to action	Median time from post creation to enforcement	Should decrease
Appeal overturn rate	% of enforced content overturned on appeal	< 10% (lower = better accuracy)
User reports per day	Volume of user reports for content that was not auto-caught	Should decrease
Reviewer agreement rate	How often reviewers agree with ML decision	> 85%

Step 4: Trade-Offs & Extensions

⚖

Precision vs. Recall by Severity

For child safety content (P0), maximize recall at any precision cost — missing one case is unacceptable. For borderline hate speech (P2), err toward precision to avoid silencing legitimate discussion.

🌐

Global Model vs. Regional Models

A swastika is a hate symbol in Europe but a religious symbol in South Asia. Build a global base model with regional fine-tuning layers that encode cultural context.

🔄

Pre-Publish vs. Reactive

Pre-publish blocking adds latency but prevents viral spread. Reactive moderation is faster to publish but harmful content may reach millions before removal. Use pre-publish for high-risk categories and reactive for lower severity.

🤖

LLM-Powered Review

Use large language models to assist human reviewers: generate policy-grounded explanations for why content was flagged, suggest the most likely policy violation, and draft user notification messages.

💡

Interview tip: Content moderation is as much a product and policy problem as it is an ML problem. Discussing the human side (reviewer well-being, cultural sensitivity, transparency reports, appeal processes) shows product maturity that interviewers value at senior levels.

← Previous Design Ad Click Prediction Next → Reusable Patterns & Tips