Design Content Moderation System
A complete walkthrough of designing an ML-powered content moderation system. This is one of the most impactful and nuanced ML systems in production — it must handle text, images, and video across dozens of languages while balancing free expression with safety at the scale of billions of posts per day.
Step 1: Clarify Requirements
- “What content types?” — Text posts, images, videos, comments, live streams
- “Scale?” — 5B posts/day, 500K posts/second peak, 50+ languages
- “What policy categories?” — Hate speech, violence, nudity, spam, misinformation, self-harm, terrorism
- “Latency requirements?” — Pre-publish: <500ms; Reactive (reported): within 1 hour
- “What actions?” — Remove, reduce distribution, add warning label, send to human review
- “Appeal process?” — Yes, users can appeal decisions
ML Problem Formulation
# Problem formulation
# Business goal: Keep platform safe while minimizing over-enforcement
# ML task: Multi-label classification (content can violate multiple policies)
# Input: Text + images + video frames + metadata + context
# Output: Per-policy violation probability + severity score
# Labels: hate_speech, violence, nudity, spam, misinformation, ...
# Training data: Human-reviewed content with policy labels
# Loss function: Weighted multi-label BCE (weight by severity and FP cost)
# Key constraint: Must support 50+ languages; policies vary by region
Step 2: High-Level Architecture
# Architecture: Multi-Layer Content Moderation
#
# [New Post Created]
# |
# [Layer 1: Hash Matching] --> Match known violating content (perceptual hash)
# | (0-strike: exact match) ~5ms, catches re-uploads of removed content
# |
# [Layer 2: Rule Engine] --> Keyword blocklists, regex patterns
# | (fast, high precision) ~10ms, catches obvious violations
# |
# [Layer 3: ML Classifiers] --> Deep learning models per modality
# |-- Text classifier ~50ms (multilingual BERT)
# |-- Image classifier ~100ms (Vision Transformer)
# |-- Video classifier ~500ms (sample frames + audio)
# |
# [Decision Engine] --> Combine scores + policy rules
# |
# [Action Router]
# |-- High confidence violation: Auto-remove
# |-- Medium confidence: Reduce distribution + queue for review
# |-- Low confidence: No action (but log for training)
# |-- User report: Prioritize in human review queue
# |
# [Human Review Queue] --> Expert reviewers make final decision
# |
# [Feedback Loop] --> Review decisions become training labels
Step 3: Deep Dive — Multi-Modal Classification
Text Classification
# Text moderation model
#
# Architecture: Fine-tuned multilingual BERT (XLM-RoBERTa)
# - Handles 100+ languages in a single model
# - Fine-tuned on platform-specific labeled data
# - Multi-head output: one head per policy category
#
# Input processing:
# 1. Normalize text (Unicode normalization, emoji handling)
# 2. Detect language (route to language-specific rules)
# 3. Tokenize with sentencepiece (handles all scripts)
# 4. Truncate to 512 tokens (or chunk long posts)
#
# Challenges:
# - Sarcasm and context: "I'm going to kill it at my exam"
# - Code-switching: mixing languages in one post
# - Adversarial obfuscation: "h@te" instead of "hate"
# - Cultural context: gesture meanings vary by culture
Image Classification
# Image moderation model
#
# Architecture: Vision Transformer (ViT-L/14) fine-tuned for safety
# - Pre-trained on large image dataset
# - Fine-tuned on safety-labeled images (10M+ labeled images)
# - Multi-label output: nudity, violence, hate_symbols, self_harm, ...
#
# Additional image analysis:
# - OCR: Extract text from images (memes, screenshots)
# --> Feed extracted text to text classifier
# - Object detection: Weapons, drugs, hate symbols
# - Face analysis: Detect if real person (for non-consensual imagery)
#
# Key challenge: Context matters
# - Medical/educational nudity vs. sexual content
# - News photography of violence vs. glorification
# - Historical images with hate symbols vs. hate speech
Video Classification
# Video moderation pipeline
#
# Challenge: Processing full video at upload time is too expensive
#
# Strategy: Multi-resolution analysis
# 1. Thumbnail scan: Check cover frame (~50ms)
# 2. Key frame extraction: Sample 1 frame/second, run image classifier
# 3. Audio analysis: Speech-to-text + audio classifier (music, screaming)
# 4. Scene change detection: Focus on transitions (where violations hide)
#
# For live streams (highest urgency):
# - Sample frames every 5 seconds
# - Lower confidence threshold (err on side of caution)
# - Auto-terminate stream if high-severity violation detected
# - Human reviewer joins within 2 minutes for flagged streams
Multi-Modal Fusion
# Combining signals from text + image + video
#
# Option A: Late fusion (independent models, combine scores)
# final_score = w1 * text_score + w2 * image_score + w3 * video_score
# Pros: Simple, each model can be updated independently
# Cons: Misses cross-modal context
#
# Option B: Early fusion (single multi-modal model)
# CLIP-like architecture: shared embedding space for text + image
# Pros: Understands "this text + this image together is harmful"
# Cons: Harder to train, single point of failure
#
# Recommended: Late fusion for V1, early fusion for high-priority categories
# - Hate speech memes: text alone is fine, image alone is fine,
# but text + image together is hateful --> needs early fusion
Deep Dive — Human Review Loop
Human review is essential for content moderation. No ML system is accurate enough to make all decisions autonomously, especially for nuanced policy categories.
Review Queue Design
| Priority | Content Type | SLA | Routing |
|---|---|---|---|
| P0 - Critical | Child safety, terrorism, imminent threats | < 15 minutes | Specialized reviewers, auto-escalate to law enforcement |
| P1 - High | Graphic violence, non-consensual imagery | < 1 hour | Senior reviewers with domain expertise |
| P2 - Medium | Hate speech, harassment, misinformation | < 24 hours | General reviewers with policy training |
| P3 - Low | Spam, minor policy violations, appeals | < 72 hours | General reviewers or ML re-scoring |
Active Learning for Review Efficiency
# Problem: We can't review all 5B posts/day (would need millions of reviewers)
# Solution: Smart routing to maximize review impact
#
# Priority scoring:
# review_priority = model_uncertainty * content_virality * severity_weight
#
# Uncertainty sampling:
# - Model predicts P(violation) = 0.52 --> high uncertainty, needs review
# - Model predicts P(violation) = 0.99 --> auto-remove, no review needed
# - Model predicts P(violation) = 0.01 --> allow, no review needed
#
# Review decisions feed back into training:
# - Reviewer labels become training data
# - Focus on cases where model is uncertain = most informative labels
# - This is active learning: model improves fastest on its weakest areas
Deep Dive — Policy Engine
The policy engine sits between ML classifiers and action routing. It translates policy rules into system behavior.
# Policy engine architecture
#
# Policies are configured as rules, not hardcoded:
# {
# "policy": "nudity",
# "regions": ["US", "EU"],
# "thresholds": {
# "auto_remove": 0.95,
# "reduce_distribution": 0.80,
# "human_review": 0.50,
# "allow": 0.00
# },
# "exceptions": ["medical_context", "art", "education"],
# "actions": {
# "remove": {"notify_user": true, "allow_appeal": true},
# "reduce": {"suppress_from_recommendations": true}
# }
# }
#
# Why a policy engine (not hardcoded)?
# - Policies change frequently (new regulations, new threats)
# - Policies vary by country (EU vs. US vs. India)
# - Non-engineers (policy team) can update thresholds without code deploys
# - A/B test different thresholds to find optimal operating points
Deep Dive — Scaling to Billions
Cost Optimization
| Strategy | Description | Impact |
|---|---|---|
| Cascade architecture | Cheap models first, expensive models only for uncertain cases | 70% of content classified by fast model, 30% need deep model |
| Hash deduplication | If identical content was already classified, reuse the result | Saves 20–30% of compute (viral reposts) |
| Model distillation | Distill large teacher model into smaller student for serving | 5x faster inference with <2% quality loss |
| Batch processing | For non-urgent content, batch GPU inference | Higher GPU utilization, lower cost per item |
| Regional processing | Process content in the region where it was posted | Reduces data transfer costs and latency |
Metrics & Evaluation
Offline Metrics
| Metric | Target | Why |
|---|---|---|
| Precision (per category) | > 90% | Over-enforcement erodes user trust |
| Recall (per category) | > 95% for P0, > 85% for P2 | Missing severe violations is unacceptable |
| False positive rate | < 1% | Wrongly removing legitimate content causes backlash |
| Cross-language parity | < 5% performance gap | Non-English content must be equally well-moderated |
Online Metrics
| Metric | Description | Guardrail |
|---|---|---|
| Violating view rate | % of content views on violating content before removal | Should decrease over time |
| Time to action | Median time from post creation to enforcement | Should decrease |
| Appeal overturn rate | % of enforced content overturned on appeal | < 10% (lower = better accuracy) |
| User reports per day | Volume of user reports for content that was not auto-caught | Should decrease |
| Reviewer agreement rate | How often reviewers agree with ML decision | > 85% |
Step 4: Trade-Offs & Extensions
Precision vs. Recall by Severity
For child safety content (P0), maximize recall at any precision cost — missing one case is unacceptable. For borderline hate speech (P2), err toward precision to avoid silencing legitimate discussion.
Global Model vs. Regional Models
A swastika is a hate symbol in Europe but a religious symbol in South Asia. Build a global base model with regional fine-tuning layers that encode cultural context.
Pre-Publish vs. Reactive
Pre-publish blocking adds latency but prevents viral spread. Reactive moderation is faster to publish but harmful content may reach millions before removal. Use pre-publish for high-risk categories and reactive for lower severity.
LLM-Powered Review
Use large language models to assist human reviewers: generate policy-grounded explanations for why content was flagged, suggest the most likely policy violation, and draft user notification messages.
Lilly Tech Systems